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Abstract 

In numerous current and future applications ranging from autonomous navigation of 
mobile robots to collision avoidance systems for cars, an imaging system (installed 
on a moving vehicle) takes 2D images of an environment with the aim of finding the 
motion of the vehicle (translational and rotational velocities) as well as the structure 
of the environment (shape). In machine vision, this problem is referred to as the 
general motion vision problem. 

This thesis introduces a direct method called fixation for solving this general mo- 
tion vision problem, arbitrary motion relative to an arbitrary environment. Avoiding 
feature correspondence and optical flow has been the motivation behind this direct 
method which uses the spatio-temporal brightness gradients of the images directly. 
The fixation method results in a linear constraint equation (Fixation Constraint Equa- 
tion) which explicitly expresses the rotational velocity in terms of the translational 
velocity. The combination of this constraint equation with the Brightness-Change 
Constraint Equation (a fundamental equation which relates the motion to the bright- 
ness gradients at any image point) solves the general motion vision problem. 

In contrast to previous direct methods, the fixation method does not impose any 
severe restrictions on the motion or the environment. Moreover, the fixation method 
neither requires tracked images as its input nor uses tracking for obtaining fixated 
images. Instead, it introduces a novel technique called the pixel shifting process to 
construct fixated images for any arbitrary fixation point. This is done entirely in 
software without any need to move the imaging system for tracking. 

This fixation method has been successfully tested in the real world environment 
for the recovery of the motion and shape in the general case. The experimental results 
are presented and the implementation issues and techniques are discussed. 

Thesis Supervisor: Dr. Berthold K. P. Horn 

Title: Professor of Electrical Engineering and Computer Science 
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Introduction 



Chapter 1 



One of the principal objects of theoretical research in 
any department of knowledge is to find the point of view 
from which the subject appears in its greatest simplicity. 

-Josiah Willard Gibbs 

A little thought about the role of vision in the tasks that humans perform in their 
everyday life leaves no doubt about its importance. For the past several decades, 
physiologists and psychophysicists have been striving to understand the underlying 
mechanisms of human vision. On a parallel track, computer vision scientists have 
been working on the development of artificial systems for performing different visual 
tasks. 

1.1 Motion Vision 

In many applications, an imaging system (installed on a moving vehicle) takes 2D 
images of the environment. In motion vision, the goal is to find the motion of the 
moving vehicle (translational and rotational velocities) as well as the shape (structure 
of the environment), using a sequence of time varying images such as those shown in 

% 1-1. 

Like many other vision problems, motion vision is extremely hard to accomplish. 

The difficulties stem from three major sources: 

11 
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Chapter 1: Introduction 





Figure 1-1: A sequence of real images where the motion between two images is a 
combination of translation and rotation. 



• Underconstrained: 

Deriving 3D information (motion and shape) from 2D data (images) is a severely 
under constrained problem (i.e. an infinite number of solutions are potentially con- 
sistent with the given data). 

• Huge Amount of Data: 

Processing even a single regular size image (512 x 512 pixels) requires handling 
of about a quarter million pixels worth of data. 

• Noise: Real image data are very noisy. 
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1.1.1 Problem statement 

The problem which we have addressed in this thesis can be summarized as follows: 

Finding the motion {relative translation and rotational velocities), and 
shape {environment structure) from a sequence of two real images by a 
direct method (not using either optical flow or feature correspondence) 
in the general case {without restricting the motion or shape). 

1.2 Previous Work (Main Approaches) 

People have been working on motion vision problems for several decades using three 
major techniques which are optical flow, feature correspondence, and direct method. 

A survey of previous literature on machine vision is given in [11] and a partial list 
of last year papers in computer vision is compiled in [51]. Some of the current issues in 
image flow theory and motion vision are discussed in [88, 4, 55]. Much of the earlier 
work on recovering motion has been based on either establishing correspondences 
between the images of prominent features (points, lines, contours, and so on) in an 
image sequence, the so called feature correspondence [48, 80, 81, 35, 3] or establishing 
the velocity of points over the whole image, commonly referred to as the optical flow 
[8, 14, 2]. 

Each of the main approaches {optical flow, feature correspondence, and direct 
methods) are described briefly in this section and an example is given for each case 
using the real image sequence in fig. 1-1. 

1.2.1 Optical flow 

The computation of the local flow field exploits a constraint equation between the 
local brightness changes and the two components of the optical flow. This only gives 
the components of flow in the direction of the brightness gradient. To compute the 
full flow field, one needs additional constraints such as the heuristic assumption that 
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the flow field is locally smooth [30, 28]. This leads to an estimated optical flow field 
which may not be the same as the true motion field. 

Figure 1-2 shows an optical flow field for the image sequence given in fig. 1-1. 
The size and direction of the apparent velocity at any pixel is shown by an arrow. 
Instead of the original images, such optical flow fields are used as a primary source 
of information in the optical flow techniques. 

The irregular optical flows on the upper edge of this map are probably due to the 
noise and inherent errors involved in the computations at the image borders. 






■**•■•*■*■*••*•*»»''» ^N^SWWSW, S "» \ \ ^ k 

^* ^ »m n »« 1 -. -. >it >, W^, •» •» >SN "•** S •» "•» V» 
,„,!,, ,, ,. ,. NN>NV1(>(>I ^\sssv% \i^Vw 

-»•*•••*•> •»•<••»•»•»•»•»-«•»-«'* fat vV*-*-*s."* 1 fsj^S'MV\ 
>«>.*.•.-,>. ^ ^ -,-v ^ s ■»>»^^ 1 «»- - - w s N'-i'^S. \ Si S S. >\ 



Figure 1-2: The optical flow map for the given real image sequence. The arrows show 
the magnitude and direction of the apparent motion at each point. 



1.2.2 Feature correspondence 

In general, identifying features here means determining gray-level corners. For images 
of smooth objects, it is difficult to find good features or corners. Furthermore, the 
correspondence problem has to be solved, that is, feature points from consecutive 
frames have to be matched. 

Figure 1-3 shows the edge map for the top image in fig. 1-1. Several correspondence 
methods use such edge maps as the basic source of data instead of the original image. 
Then, they try to find some common features in different edge maps and relate them 
together. 
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Figure 1-3: The edge maps for each of the images in the sequence. 



1.2.3 Direct methods 



The use of optical flow or correspondence techniques for solving motion vision prob- 
lems has proven to be rather unreliable and computationally expensive [84, 83, 34]. 

These techniques spend a lot of effort on transforming the original images to the 
optical flow or the edge maps. The assumptions made in these procedures result in 
errors and loss of some useful information which exists in the original images. 

These problems have motivated the investigation of direct methods which use the 
image brightness information directly to recover the motion and shape without any 
need to preprocess the original image. 

Previous work in direct motion vision has used the Brightness-Change Constraint 
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Equation (BCCE) for rigid body motion [44] 

E t + vu + ^ = (1.1) 

to solve special cases such as known depth [30], pure translation or known rotation 
[31], pure rotation [31], and planar world [44]. Chapter 2 describes the details of this 
nonlinear equation which relates depth Z, translational velocity t, and rotational 
velocity uj together. 

All these direct methods are restricted in the types of motion or shape that they 
can handle. Our aim is to solve the motion vision problem in the general case using a 
direct method but without restricting either the motion or the shape to any special 
case. 



1.3 Fixation Approach 

This thesis presents a direct method called fixation for solving the motion vision 
problem in the general case without placing any restrictions on the motion or the 
shape [65, 69, 60]. The fixation method is based on the theoretical proof that for a 
sequence of fixated images (a sequence of images with one stationary image point in 
them), the 3D rotational velocity u can always be explicitly expressed in terms of a 
linear function of the 3D translational velocity t. Namely, 

u; = cu Ro R + p-jj(txR ) (1.2) 

where R is the unit vector along the position vector of the fixation point (a point 
in the image plane which stays stationary) and u;r o is the component of rotational 
velocity about the fixation axis R . 

It should be emphasized that we do not need to know the real fixation point, if 
there is any, to take advantage of this fixation constraint equation (FCE), eqn. 1.2. In 
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fact, our algorithm allows us to choose virtually any point as the fixation point and 
obtain a sequence of fixated images [65, 69] by a simple software manipulation of one 
of the original images 

The combination of the Fixation Constraint Equation (FCE), eqn. 1.2, and the 
BCCE, eqn. 1.1 offers a solution to the motion vision problem of arbitrary motion 
relative to an arbitrary rigid environment. That is, it allows recovery of the depth 
map Z, total 3D rotational velocity, and 3D translational velocity t without placing 
severe restrictions on the motion or the shape [65, 69]. 



1.4 Contributions 

A summary of the principal contributions of this thesis are as follows. 

• Derivation of the Fixation Constraint Equation: 

Deriving a strong constraint equation called the fixation constraint equation (FCE). 
This constraint equation has a solid mathematical foundation. It expresses that for a 
sequence of fixated images, the rotational velocity can always be explicitly expressed 
as a linear function of translational velocity [69, 62, 61]. This equation is general and 
no hidden assumptions were made in its derivation. 

• Obtaining a solution to the general motion problem: 

Introducing a direct method called the fixation method which provides a solution for 
the general motion vision problem and has the following properties [69, 60, 63] : 

- Finds the motion [translational and rotational velocities), and shape (the environ- 
ment structure) from two monocular images. 

- Does not restrict the motion or shape, 

- Does not use either optical flow or feature correspondence. 

- Is computationally simple. 

• Tracking without moving the camera: 

Presenting a novel method called the pixel shifting process for constructing a sequence 
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of fixated (tracked) images from any arbitrary image sequence, [65, 64]. It allows an 
arbitrary choice of fixation point, is fully software based, and does not require moving 
the camera for tracking. 

• Autonomous choice of an optimum fixation patch size: 

Finding a technique for autonomous choice of an optimum fixation patch size which 
results in good estimates for the motion parameters. This technique is based on 
defining a norm called normalized error and has been successfully implemented and 
tested on real images [68, 72, 66]. 

• Autonomous choice of an appropriate fixation point location: 

Some regions of a given image are better for using a fixation patches. We have 
developed a method for autonomous choice of an appropriate fixation point location 
[67, 72]. 

• Rotation axis calibration: 

Introducing a procedure for the calibration of a rotation axis in imaging systems. This 
technique is simple but useful and results in avoiding potential implementation errors 
[70, 72]. 

• Representing image gradients: 

A novel method has been presented for visual representation of the spatio-temporal 
gradients. These intensity gradient maps allow one to visually understand the char- 
acteristics and significance of the brightness gradients [73, 70]. 

• Constructing fixated (tracked) image sequences: 

Using the pixel shifting process and a bilinear interpolation technique we have con- 
structed fixated images from real images [73, 70]. 

• Depth map recovery from two monocular real images: 

We have recovered good depth maps from two monocular real images using the fixa- 
tion method [71, 70]. 
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1.5 Thesis Structure 

This work comprises of three parts: Theory, Implementation, and Appendices. 

1.5.1 Part I: Theory 

This part covers the mathematical background of direct methods and the detailed 
theory of fixation. 

• Chapter 2 

We begin with a description of the camera model and coordinate system used in 
this work. Then, the brightness change constraint equation (BCCE) used by direct 
methods is explained. 

• Chapter 3 

This chapter presents the main idea behind our fixation method. It shows how the 
Fixation Constraint Equation (FCE) is derived and how it can be combined with the 
BCCE in order to solve for the translational velocity t, rotational velocity u>, and the 
depth Z at any image point. 

• Chapter 4 

In an arbitrary image sequence, a point chosen as the fixation point does not neces- 
sarily stay stationary in the image plane. This chapter introduces the algorithms for 
the estimation of the apparent velocity at the fixation point (fixation velocity) which 
are required for the construction of a sequence of fixated images. Simultaneously, 
these algorithms find an estimate for the component of the rotational velocity along 
the fixation axis, o>r , which appears in the FCE. 

• Chapter 5 

The fixation method requires a sequence of fixated images. This chapter shows how 
a sequence of fixated images can be constructed from an arbitrary image sequence 
using the components of the fixation velocity. 

• Chapter 6 
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This chapter ends the basic theoretical part of the thesis by giving an overview of the 
main modules involved in the fixation method. 

1.5.2 Part II: Implementation 

This part presents the experimental results of applying the algorithms given in Part 
I to real image sequences. The implementation issues are described along with tech- 
niques for dealing with some practical problems. 

• Chapter 7 

The spatio-temporal brightness gradients of the images are the primary source of 
data used in our fixation method. This chapter introduces a novel technique for 
representing the gradients of real images. Such representations allow us to have a 
better insight about the characteristics and significance of gradients. 

• Chapter 8 

The experimental results in this chapter show that the estimated values for the com- 
ponents of the fixation velocity and cur, o depend heavily on the size of the image 
patch used in the computation. It will be shown that depending on the image, and 
the fixation point location, there are some patch sizes which result in good estimates 
for the desired motion parameters. 

• Chapter 9 

This chapter presents a novel and reliable technique for autonomous choice of an 
optimum fixation patch size that results in good estimations for the motion parameters 
from real noisy images. 

• Chapter 10 

The fixation method does not place any restrictions on the choice of the fixation 
point and virtually any point can be chosen as the fixation point. However, some 
considerations should be taken into account when choosing a fixation point. For 
example, choosing a point at the center of a patch which has uniform brightness is not 
good because the motion is not detectable. This chapter introduces an autonomous 
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technique for choosing an appropriate fixation point. 

• Chapter 11 

Not only in our fixation technique but also in many other methods there is a substan- 
tial need for a sequence of fixated (tracked) images. This chapter introduces a novel 
method (pixel shifting process) for constructing a sequence of fixated images from an 
arbitrary image sequence using the components of the fixation velocity. 

• Chapter 12 

Using the estimated motion parameters and the constructed sequence of fixated im- 
ages, this chapter describes the issues involved in recovering depth maps. Detailed 
techniques are presented for overcoming practical problems such as noise and inherent 
image deficiencies. 

• Chapter 13 

Camera calibration is usually an unavoidable requirement for working with real im- 
ages. This chapter discusses some of the calibration issues that we faced in this 
work. 

• Chapter 14 

We conclude this work by giving a summary of the fixation method, results, features, 
assumptions, shortcomings, relation to other works, and finally some thoughts on the 
possible future extensions. 

1.5.3 Part III: Supplements 

Some of the relevant theoretical proofs and formulations are summarized in this part. 

• Appendix A 

Provides a detailed derivation of the BCCE. 

• Appendix B 

Presents the formulations for computing the spatio-temporal gradients. 

• Appendix C 

Describes a technique for computing the depth at the fixation point, Z . 
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Direct Methods 



Chapter 2 



Images are usually obtained from a regular electronic camera where the projection 
is perspective. In this chapter, we first describe the camera model and the coordi- 
nate system used in this work. Then, a mathematical background of the BCCE is 
presented. 

2.1 Modeling and Coordinate System 

As shown in fig. 2-1, the coordinate system is attached to the camera so that its origin 
is located at the projection center. 

The image plane is where the environment image is projected to. In an electronic 
camera, a CCD (Charge Coupled Device) plays the role of the image plane. The CCD 
is an electronic light-sensitive plane. It consists of a tessellation of small rectangular 
or square photo-sensitive cells which are called pixels. Each pixel of the CCD is 
electronically charged depending on the number of the photons it receives. Thus, the 
charge level of each pixel is a representation of the brightness at the corresponding 
point in the image plane. By reading and appropriate conversion of the camera charge 
level of all pixels, the image can be written in a file or displayed on a screen. 

The image plane in our coordinate system is parallel to the X — Y plane and is 
located at a distance equal to the focal length from it. The optical axis Z pierces the 
image plane at a point which is called the principal point. Any environment point R 
is projected to an image point r in this coordinate system. 
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Figure 2-1: The coordinate system is attached to the camera and the projection is 
perspective. 



2.2 Basic Definitions 

Using a viewer-centered coordinate system which is adopted from Longuet-Higgins Sz 
Prazdny [36] is very common in direct motion vision. Figure 2-2 depicts the coordinate 
system under consideration. 

In such a coordinate system, a world point 



R={XY Zf 



(2.1) 



is imaged at 



r=(xy I) 1 



(2.2) 



That is, the image plane has the equation Z = 1 or in other words the focal length 
/is 1. The origin is at the projection center and the Z-axis runs along the optical 
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Figure 2-2: Under the effect of translational velocity of the viewer is t = (U V W) T 
and rotational velocity is u — (A B C) T , any environment point R has the velocity 
Rt from the observer's point of view. 



axis. The X and Y axes are parallel to the x and y axes of the image plane. Image 
coordinates are measured relative to the principal point, the point (0 1) T where the 
optical axis pierces the image plane. The position vectors r and R are related by the 
perspective projection equation 



r = (xyl) T =( 



X Y zy _ R 
~Z ~Z ~ZJ ~ K z 



(2.3) 



where z denotes the unit vector along the Z— axis and R • z = Z. 

When the observer moves with instantaneous translational velocity t = (U V W) 
and instantaneous rotational velocity uj = (A B C) T relative to an environment, then 
the time derivative of the position vector of a point in the environment, R, relative 
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to the observer can be written as 



R t = -t-uxR. 



(2-4) 



The motion of the world point R results in the motion of its corresponding image 
point r. It can be shown that the motion field in the image plane is obtained by 
differentiating eqn. 2.3 with respect to time as in [44] 



d 



R 



r t = -r. 



dt \ R z 



z x (R t x r) 
R z 



(2.5) 



Substituting for R, r and R f from equations 2.1, 2.2, and 2.4 into eqn. 2.5 gives 
[36, 14] 



/ \ 



yt 
\ Zt ) 



=^^-Bxy + A(y 2 +l)-Cx 



\ 







(2.6) 



/ 



This result is just the parallax equations of photogrammetry that occur in the incre- 
mental adjustment of relative orientation [23, 42]. It shows how, given the environ- 
ment motion, the motion field can be calculated for every image point. 



2.3 The Brightness Change Constraint Equation 



Image brightness changes are primarily due to the relative motion between an en- 
vironment and an observer provided that the surfaces of the objects have sufficient 
texture and the lighting condition varies slowly enough both spatially and with time. 
In such cases (which may occur in practical applications), brightness changes due to 
the variations in the surface orientation and illumination can be neglected. Conse- 
quently, we may assume that the brightness of a small patch on a surface in the scene 
does not change during motion. As shown in appendix A, when the motion is small 
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the expansion of the total derivative of brightness E leads to 



dE 

— = E t + x t E x + y t E y = 1 
at 



(2.7) 



known as the Brightness Change Constraint Equation (BCCE) where {E x , E y ) and 
E t are spatial and temporal gradients of the image brightness at any given pixel 
[30, 54, 29]. 

Note that eqn. 2.7 does not hold for the special case that the viewer and the 
light source are stationary and the environment moves relative to them because the 
brightness of a surface patch does not remain constant in this case. 



2.3.1 Rigid body motion 

In rigid body motion, there is only one relative motion between the observer and the 
environment. For this case, we can substitute for x t and y t from eqn. 2.6 into eqn. 2.7, 
to obtain the brightness-change constraint equation for the rigid body motion [44] as 



E t + v -U + —=- = 0. 



(2.8) 



This equation is nonlinear in terms of unknowns rotation u, translation t, and depth 
Z. The auxiliary vectors s and v are known at any pixel (x,y) and are defined as 



/ 



s = 



—E x 
—E„ 



xE x + yE y ^ 



(2.9) 



x To account for smooth variations in the image brightness due to other factors such as shading, 
spatial and temporal illumination changes, and variations in reflectance properties, the BCCE can 
be extended to 

E t + x t E x + y t E y - m t E+ c t 

where in general m t and c ( are time and position dependent [21, 45]. Cornelius k Kanade [17] also 
propose a method which allows gradual changes in ^. These extensions are not discussed here. 
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and 



' +E y + y{xE x + yE y ) * 



\ 



J 



v= -E x -x(xE x + yE y ) • (2.10) 

yE x - xE y 

Since s-r = 0, v-r = and s • v = 0, the vectors r, s, and v form an orthogonal triad; 
see fig. 2-3. The vectors s and v represent inherent properties of the image. Also it 
can be shown that v = r x s. The vector s indicates the direction in which translation 
of a given magnitude will contribute maximally to the temporal brightness change of 
a given picture cell. The vector v plays a similar role for rotation. 



v = rxs 




Figure 2-3: At any pixel, vectors r (pixel position), s, and v form an orthogonal triad. 
Also v = r x s. 



The BCCE, eqn. 2.7, does not change if we scale both Z and t by the same factor. 
Consequently, we can determine only the direction of translational velocity and the 
relative depth of points in the scene. This ambiguity is known as the scale-factor 
ambiguity in motion vision. 

Equation 2.7 is obtained under the following assumptions: 

• No noise, 

• Sufficient surface texture, 



MJlJfPJlliP ' lJjWLl i IJUi.lll W ypi^ 



^^^^^]^^^^ co ^ r ^ nt *&? **** 



• Slow spatio-temporal variations in lighting, 

• Small motions between frames. 

In real images, violation of any of these conditions may cause end. 2.7 not to foe 
held at any single pixel. However, later we wtHi shew taw tim tikm can be used in 
a least squares method for recovery of shape and motion lee** real image sequences. 
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Fixation Formulation 



Chapter 3 



Our common visual experience suggests that fixation may play an important role 
in the analysis of moving objects. When we want to understand the motion of an 
object, we do not keep our eyes and head stationary in front of the moving object. 
Instead, our head and/or eyes follow the moving object, in order to keep the image 
of a point of interest stationary in the retina. There are also some formal studies 
that support such observations [6, 7, 9]. In this computer vision work, the fixation is 
defined as: 

Given two subsequent images, 1st and 2nd initial images, and an arbitrary 
point in the 1st initial image, find a new image, a 2nd fixated image, such 
that the image of the selected point in the new image is located at its 
original position as in the 1st initial image. 

This definition of fixation is shown schematically in fig. 3-1. If we choose point 1 
in the 1st initial image as the fixation point, its image in the 2nd initial image may 
move to a new location such as 2. In chapter 5, we introduce a simple technique for 
converting the 2nd initial image in order to bring image point 2 to the same physical 
location as point 1. This process will construct the 2nd fixated image and form a 
sequence of images fixated at point 1. 

As shown in fig. 3-2, we refer to this arbitrary selected image point as the fixation 
point, r , and to its corresponding point on the object as the interest point, R . 
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A Sequence of Fixated Images 
(at Point 1) 



1st Initial Image 





2nd Initial Image 



2nd Fixated Image 



Figure 3-1: A schematic interpretation of fixation point and fixated image sequence. 



3.1 Derivation of the General Fixation Constraint 



Equation 



For a sequence of two fixated images, at the fixation point r we should have 



r ot = 



(3.1) 



3.1: Derivation of the General Fixation Constraint Equation 
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Fixation Point 



Y" 



. Fixation Axis 




Figure 3-2: In the fixation method, the image of the interest point, the fixation point, 
is kept stationary in the image plane despite the relative motion between the camera 
and the environment. 



where r ot is the time derivative of the fixation point vector and similar to eqn. 2.5 it 

can be written as 

z x (R ot x r ) , v 

r < = ~ 1 • l°- z J 

Ro • z 

R ot is the time derivative of the interest point vector. Combination of equations 3.1 

and 3.2 shows that for fixation we need to have 



z x (R ot x r ) = 0. 



(3.3) 



In other words, we want to find out when R t x r is zero or parallel to z. For R Q < x r 
to be parallel to z, we should have r c perpendicular to z which is not possible with a 
finite field of view, so only R ot x r = applies. Consequently, considering that R 
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and r have the same direction, eqn. 3.3 is simplified as 

R 0( x R = (3.4) 

Now substituting for R 0( = — t — u x R , eqn. 2.4, into eqn. 3.4 gives 

(w x R ) x R + t x R = 0. (3.5) 

Expansion of eqn. 3.5 by using (a x b) x c = (c • a)b — (c • b)a results in 

(Ro • w)R - (R • R V + 1 x R = 0. (3.6) 

As long as the translational velocity t is neither zero nor parallel to the interest 
point vector R , then any vector, including u, can be expressed in terms of the triad 
of vectors R , t x R and t. So we can write u> in its general form as 

w = «R + /?(t x Ro) + 7 t (3.7) 

where a, (3 and 7 are parameters to be determined. Later in this section we will 
consider the special cases where t is zero or parallel to R by defining u> based on 
another triad of vectors. 

Substituting for u from eqn. 3.7 into eqn. 3.6 gives 

[1 - /?(Ro • Ro)](t x R ) + 7 (R ■ t)R - 7 (R • Ro)t = 0. (3.8) 

Now, we should find the parameters /? and 7 such that eqn. 3.8 holds without placing 
any restrictions on either R or t. We start by finding the dot product of eqn. 3.8 
with t x R which results in 

[l-/?(Ro-R o )]||txR o || 2 = 0. (3.9) 
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Equation 3.9 will hold without restricting either R or t if 

8 = —!—. (3.10) 

||Ro|| 2 

Another possibility for satisfying eqn. 3.9 is to have ||t x R || = which implies that 
either t or R is zero, or t is parallel to R . But R cannot be zero and also we 
assumed that here t is neither zero nor parallel to R . As a result, ||t x R || cannot 
be zero. 

Similarly the dot product of eqn. 3.8 with t gives 

7 (R • t)(Ro • t) - 7 (R • R )(t • t) = 0. (3.11) 

Knowing that (axb)-(c x d) = (c-a)(b-d) - (b-c)(d-a), eqn. 3.11 can be simplified 
as 

7 ||txR o || 2 = 0. (3.12) 

We discussed that ||t x R || cannot be zero here, so eqn. 3.12 is satisfied only if 7 is 
zero 

7 = 0. (3.13) 

Substituting for /3 from eqn. 3.10 and 7 from eqn. 3.13 into eqn. 3.7 gives 

u = aK + -±-(txR ) (3.14) 

||Ro|| 2 

where a is still unknown. This means that the component of the rotational velocity 
along R cannot be determined by the fixation formulation. Physically this makes 
sense because the rotational velocity along R , denoted by wr„, does not move the 
fixation point. This observation leads us to find wr in a separate step before using 
the fixation formulation results. Derivation of wr will be shown in chapter 4. 
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As a result, the fixation constraint equation (FCE) is written as 

u = lo Ko R + — — (txR ) (3.15) 

1 1 **-° 1 1 

where t is the translational velocity and R = r is the unit vector along the position 
vector of an arbitrary fixation point, an arbitrary point in the image chosen for 
fixation. Equation 3.15 shows that after fixation, the rotational velocity uj can be 
explicitly expressed as a linear function of the translational velocity t. 



3.1.1 Derivation of special fixation constraint equation 

When the translational velocity t is zero or parallel to the interest point vector R , 
eqn. 3.6 is simplified as 

(Ro • w)R - (Ro • R )u; = 0. (3.16) 

This time, u> is defined based on the triad consisting of vectors R , x, and x x R as 

u) = /R + m(x x R ) + nx (3. 17) 

where /, m, and n are parameters to be determined. Here we assume that R is not 
parallel to x. This is a reasonable assumption because otherwise we should at least 
have a field of view of 180° to be able to choose an awkward interest point along the 
X-axis, which results in a fixation point at an infinite distance from the principal 
point and near the border of an infinite image plane. 
Substituting for u from eqn. 3.17 into eqn. 3.16 gives 

n(R • x)R - m(R • R )(x x R ) - n(R • R )x = 0. (3.18) 
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The dot product of eqn. 3.18 with (x x R ) results in 

-m(R o -R o )||xxR o || 2 = 0. (3.19) 

Considering that R cannot be either zero or parallel to x, eqn. 3.19 is satisfied only 
if m is zero 

m = 0. (3.20) 

Substituting for m into eqn. 3.18 and finding its dot product by x results in 

n(R • x)(Ro • x) - n(R • R )(x ■ x) = 0. (3.21) 

Using (a x b) • (c x d) = (c • a)(b • d) - (b • c)(d • a), eqn. 3.21 can be written as 

n||x x R || 2 = 0. (3.22) 

Again R cannot be either zero or parallel to x. As a result, eqn. 3.22 will hold for 
arbitrary R if n = 0. Substituting for n and m into eqn. 3.17 gives 

u> = IRo (3.23) 

where / is still unknown. We can substitute u>r, R o for /R . The procedure for 
computing the component of rotational velocity along the fixation axis, u>r , will be 
given in chapter 4. Consequently, for the special cases we obtain the special fixation 
constraint equation (SFCE) as 

o; = cu Ro R (3.24) 

which means that when the translational velocity t is zero or parallel to R then the 
corresponding rotational velocity may only have a component along R . 

This procedure for deriving the SFCE, eqn. 3.24, is not essentially different from 
what we did for deriving the FCE, eqn. 3.15. In fact, eqn. 3.24 is a special case of 



38 Chapter 3: Fixation Formulation 

eqn. 3.15. But we did not directly derive eqn. 3.24 from eqn. 3.15 because eqn. 3.15 
was derived based on the assumption that t is neither zero nor parallel to R . As a 
result, for implementation it is enough to use the FCE, eqn. 3.15, without knowing 
whether the present condition is a special case or not. 

3.1.2 Interpretation of the FCE 

We gave a detailed mathematical proof for derivation of the fixation constraint equa- 
tion (FCE), eqn. 3.15. This constraint equation indicates that for a sequence of 
fixated images, the rotational velocity u can always be expressed as a linear function 
of the translational velocity t. This section examines whether the FCE makes sense 
phsically. 

The first term u>i{_ o R says that u> can have an unrestricted component along the 
fixation axis R . This is correct because such a component does not cause the fixation 
point to move and as a result the fixation is not violated. 

The term of the FCE, rp^— rr(t x R ), conveys two points: 

• The translation t can have an arbitrary component along the fixation axis R 
because such a component does not move the fixation point in the image plane. 

• The rotational velocity uj should have a component perpendicular to R and be 
large enough to compensate for the component of the translational velocity t which 
is perpendicular to R in order to keep the fixation stationary in the image plane. 

We can conclude that the FCE has a meaningful physical interpretation. 

3.2 Solving the General Direct Motion Vision Prob- 
lem 

At this stage, we assume that a sequence of two fixated images have been constructed. 
In other words, we have made the fixation point stationary in the image plane. This 
can be done first by finding the fixation velocity, the apparent velocity at the fixation 
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point in the 1st image, as shown in chapter 4. Then the pixel shifting process explained 
in chapter 5 can be used for constructing a new image, the 2nd fixated image, in which 
the image of the interest point is positioned at the same point as in the 1st initial 
image. 

We start by studying the general case where the translational velocity t is neither 
zero nor parallel to the interest point vector R . The special cases of t will be 
discussed later. 

Substituting for u from the fixation constraint equation 3.15 into the brightness- 
change constraint equation 2.8 gives 

E t + u Ko v ■ R + p-r[v • (t x R )] + i(s • t) = 0. (3.25) 

Knowing that a • (b x c) = (a x b) • c and doing some manipulations on eqn. 3.25 
results in 

£ <' + ii s -pb (vxlMl ' t=0 (3 - 26) 

where E[ is a notation for E t + <^r v • R which is computable at any pixel assuming 
that u>r o is known. In chapter 4, we will introduce a technique which finds a good 
estimate for wr . 

In general, eqn. 3.26 can be solved numerically for t and Z using images of any 
size and with any field of view. For a small patch around the fixation point, called a 
fixation patch, eqn. 3.26 can be simplified as 

^' + (|-^)(s-t)«0. 1 (3.27) 



'Considering that ||R || = 2 ||r || and v = r x s, the term po( v x R °) from ec l n - 326 > let ' s 
call it K , can be expanded as 

K = z<,||ro||( r x s ) x ptif- 
Further expansion of K by using the relation (a x b) x c = (c • a)b — (c • b)a, results in 

K = z7pqF [(r °' r)s " (r °' s)r ^ 
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As described in the footnote, the approximation made here is based on a purely 
geometric assumption and is not related to the image properties. For example, we 
are not making any assumptions about the depth topology. We simply assume that 
motion parameters can be obtained using a small fixation patch. As shown in fig. 3-3, 
the smallness of such a patch translates into the smallness of an angle a. Numerous 



Fixation Patch 




Fixation Point 



Figure 3-3: A schematic interpretation of fixation point and fixated image sequence. 



experimental results in chapter 9 show that indeed good motion estimates are obtained 
using optimum patch sizes with a field of view small enough to justify this assumption. 
In analogy to the pure translation case of [31], we can find the translational velocity 
t. Equation 3.27 shows that l/(^ — 4-) = — 7^7. At the points where E[ is very small, 
even a small error in computing t will result in large error in l/(^ — ^-) which 
translates into large error in the estimation of depth Z. Considering this fact, the 
true translational velocity t can be found from eqn. 3.27 by minimizing 



3 = // ( l ~ Y rHxdv = II^W )2dxdy 



(3.28) 



It is clear that at the fixation point, where r = r and s = s , K = 4-s and for the points near the 
fixation point K ss -J-s. 
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with respect to t. In other words, we are looking for the true motion t which minimizes 
the sum of squares of |£ over the fixation patch. Note that this minimization does 
not force Z towards Z because at Z = Z the value of J becomes infinite. 

We also put the ||t|| = 1 constraint on this minimization problem to avoid the 
trivial solution t = 0. This is a valid constraint on t because due to the scale factor 
ambiguity we can only find the direction of t. This constraint on t can be written as 

t T t = l. (3.29) 

Moreover we can rewrite J as 

J = t T Mt (3.30) 

where M is a fully computable 3x3 symmetric matrix 

M = JJ{^) 2 ss T dxdy. (3.31) 

Minimizing J in eqn. 3.30 under the constraint eqn. 3.29 is an ordinary calculus 
constrained minimization problem which can be solved by minimizing 

/(t, A)=t T Mt + A(l-t T t) (3.32) 

with respect to t and the Lagrange multiplier A. Then, we will obtain 

^ = 2Mt - 2At = (3.33) 

at 



which is simplified as 



Mt = At. (3.34) 



Equation 3.34 is an eigenvalue problem where A is an eigenvalue of the known matrix 
M and t is the corresponding eigenvector. The eigenvalues of M are real and nonnega- 
tive because M is a positive semidefinite Hermitian matrix. Substituting for Mt from 
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eqn. 3.34 into eqn. 3.32 gives / = A which implies that under the given constraint, 
t Mt is minimized when the smallest of three real and nonnegative eigenvalues is 
used for computing the eigenvector t. 

It is shown that the fixation method can be used for solving the motion vision 
problem in its general case. The translational velocity t is obtained from eqn. 3.34 
by using the smallest eigenvalue and computing its corresponding eigenvector. Then 
we can use eqn. 3.26 for finding the depth map, a depth at each image point, as 

Z = (Li) . (3.35) 

(vx ft.)-t _ E , 
Then, eqn. 3.15 gives the partial rotational velocity u> 

w = w RA + ^i(txR.) (3.36) 

where ||R || = Z ||r || and Z is the depth at the fixation point. Appendix C intro- 
duces a technique for estimating Z . 

The total rotational velocity of the observer relative to the environment is obtained 
by adding uj to the equivalent rotational velocity O given in chapter 5. It can be seen 
that for the general case, the fixation formulation lets us find the shape and motion 
by choosing virtually any point as the fixation point. 

3.2.1 Special cases: t is zero or parallel to R c 

When the translational velocity t is zero, we showed that the partial rotational ve- 
locity u> has only a component about the fixation axis R , eqn. 3.24. The technique 
for computing this component of rotational velocity is given in chapter 4. For this 
special case, pure rotation, there are also methods for finding the total rotational 
velocity using the initial unfixated images [31]. In the case of t = 0, we basically 
cannot obtain any estimation for the depth Z. 
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For the other special case that t is parallel to R , we substitute for u) from eqn. 3.24 
into the BCCE eqn. 2.8 to obtain 

£ t ' + |(s.t) = (3.37) 

where E' t is again a notation for the computable term E t + wrJ • R . Because no 
approximation is involved in deriving eqn. 3.37, an exact closed form solution exists 
for t and Z without any restriction on the field of view or the size of fixation patch. 
This exact solution for finding t and Z is the same as the solution given in the general 
case, starting from eqn. 3.28, except that J is defined as // Z 2 dxdy for this special 
case. 
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Computing the Fixation Velocity 
and Rotational Component ujj^ 

Chapter 4 



In an arbitrary image sequence, a point chosen as the fixation point does not 
necessarily stay stationary in the image plane. We use the term fixation velocity to 
refer to the apparent velocity at the fixation point in the initial 1st image. As shown 
in fig. 4-1, the x and y components of the fixation velocity are represented by u and 
v respectively. 

The fixation method requires a sequence of two fixated images in which the fixation 
point stays stationary, r ot = 0. A fixated image sequence can be obtained by first 
finding u and v , and then using these components to construct a new image, the 
fixated 2nd image. The technique for the construction of the fixated 2nd image {pixel 
shifting process) is explained in chapter 5. 

We also saw that the component of the rotational velocity along the fixation axis, 
ur. o , cannot be obtained from the fixation formulation because this component does 
not move the fixation point. 

In this chapter, we will introduce an algorithm for obtaining not only the rotation 
W R Du t also the components of the fixation velocity, u and v . 
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n Y 



Fixation Point 
(x ,y ) 

1 o J ' 




Figure 4-1: In general, for any point chosen as the fixation point, there is an associated 
apparent velocity (fixation velocity), and a rotational component along the fixation 
axis, u;r o . The components of fixation velocity are shown by (u ,v ). 



4.1 Algorithm 

The motion field velocity due to the rotational velocity component u>r, is given by 
-(wr x r) = — wr (Ro x r) = — ji^rjKro x r), where R = f is the unit vector 
along the fixation axis r . Considering a small patch around the fixation point, and 
substituting r = (x y 1) T and r = (x y 1) T , the components of the total motion 
field velocity due to the fixation velocity and o>r , are given by 



x t — u 



^x • (r x r) = u + un (y - y Q ) 



Vt 



Vo - ij7%y • (r x r) = v - u Ro (x - x ) 



(4.1) 



where x and y are the unit vectors along the x and y axes and cjr o is a notation for 
^. Substituting for x t and y t from the above equations into the BCCE, eqn. 2.7, 



\\r .. 
gives 



[u + uj Ko (y - y )]E x + [v - w Ro (a; - x )]E y + E t = 0. 



(4.2) 



4-1: Algorithm 
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Due to noise, eqn. 4.2 does not necessarily hold for any point (x,y). Thus, we try 
to find u ,v and wr, by minimizing the sum of squares of errors over the fixation 
patch. In other words we want to minimize 

JJ[(u + u Ko (y ~ Vo))E x + (i\> -u Ro (x - x ))E y + E t ) 2 dxdy (4.3) 

with respect to u , v and u>r. o . This results in a system of three linear equations 
that can be solved for the three unknowns 



(4.4) 
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Matrix A is symmetric and its elements are given by 



a 12 = // E x E y dx dy 

a 13 = // E x [E x (y - y ) - E y (x - x )}dx dy 

«23 = II E y [E x (y-y )- E y (x-x )]dxdy 

an = HE'ldxdy 

«22 = IIE^dxdy 

«33 = II[E x (y - y ) - E y (x - x )} 2 dx dy 

and the components of vector C are as follows: 



(4.5) 



ci = -// E t E x dx dy 

c 2 = -IIE t E y dxdy (4-6) 

c 3 = -// E t [E x (y - y ) - E y (x - x ))dx dy. 

Considering that the fixation point coordinates x and y are known, the sets of 
equations 4.5 and 4.6 show that the elements of matrix A and the components of 
vector C are fully computable. 
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4.2 Discussion 

When the spatio-temporal gradients are zero, matrix A is irreversible because all of its 
elements are zero. As a result, we will not be able to compute the motion components 
in such a case. Chapter 10 explains how to avoid this by an autonomous choice of 
an appropriate fixation point that is not located in a patch with uniform brightness. 
Furthermore, for implementation we make sure that the determinant of matrix A is 
nonzero before advancing into the computations. 

In the special case where the fixation point is at the principal point, x = y = 0, 
elements of matrix A are simplified as 



«12 


= // E x E y dx dy 


Ol3 


= ff E x (yE x - xE y )dx dy 


«23 


= JJ E y (yE x - xE y )dx dy 


a n 


= SJEldxdy 


0-22 


= UEldxdy 


a 3 3 


= Jf(yE x - xE y fdx dy 



(4.7) 



and components of vector C are given as follows 



c\ = -// E t E x dxdy 
c 2 = -// E t E y dxdy 
c 3 = -// E t (yE x - xE y )dx dy. 



(4.8) 



After finding o>r, , we can easily compute u>r = wr Jxl + yl + 1 . Clearly, when 
the fixation point is at the principal point, u;r becomes equal to wr . 

The algorithm given in this chapter has been successfully implemented on real 
images and good estimates have been obtained for the fixation velocity components 
and u?r o . Chapter 8 describes the implementation results. 



Constructing a Sequence of 
Fixated Images 

Chapter 5 



The fixation method requires a sequence of two images in which the fixation point 
is kept stationary. However, the input can be an arbitrary sequence of two images 
that we shall call the 1st initial and 2nd initial images. The 1st initial image is used 
directly as the 1st fixated image but we need to find a 2nd fixated image using the 
2nd initial image. 

Physical rotation of the camera relative to the observer base is a hardware solution 
to this problem which is basically a tracking problem. Considering that in general 
the interest point has a motion relative to the observer, the 2nd fixated image cannot 
be obtained in one step. As a result, a feedback control loop is required for the 
camera rotation system to compensate for the errors resulting from the new position 
of the fixation point. This tracking approach is to be avoided not only because of the 
potential errors involved but also because of concern about real time applications. 

In this chapter, we will show how a 2nd fixated image can be constructed by a 
purely software technique, the pixel shifting process. It involves applying an imaginary 
rotation to the vision system and determining the corresponding transformation which 
affects the 2nd initial image. 
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Chapter 5: Constructing a Sequence of Fixated Images 



5.1 Equivalent Rotational Velocity 

If point 1 is chosen as the fixation point in the 1st initial image, then in general its 
corresponding image point in the 2nd initial image moves to a new location such as 
point 2; see fig. 5-1. 




Figure 5-1: An imaginary rotation opposite to the equivalent rotational velocity, —SI, 
is applied to the vision system to bring point 2 to point 1. This rotation transforms 
the 2nd initial image into the 2nd fixated image. 



Determining the location of point 2 is equivalent to the estimation of the fixation 
velocity. Chapter 4 introduced a technique for the estimation of the fixation velocity. 
The experimental results in chapters 8 and 9 will also show that the fixation velocity 
can be estimated reliably even from real and noisy images. As a result, it is assumed 
here that the fixation velocity has been already computed from eqn. 4.4. 

There are infinite combinations of translations and rotations which can be ap- 
plied to the vision system or camera to bring the image point at 2 to the location 1 . 
Among all these combinations, we choose to accomplish the task by a pure rota- 



5.1: Equivalent Rotational Velocity 



51 



tion. To find the desired rotation, we first introduce an equivalent rotational velocity, 
O = (ft-,., il y , ft z ), as a rotation which can result in the same fixation velocity (u , v ) 
at the fixation point (x , y ). According to eqn. 2.6, the components of ft must satisfy 
the following set of equations 

u = x y Q x - (xl + l)fl y + yo^tz ,. 1X 

(5.1) 

v o = (j/o + 1)^ - x y £l y - x Sl z . 

There are also infinite configurations of ft that satisfy the system of equations in 5.1. 
However, we choose the only one that does not introduce any new rotational velocity 
along the fixation axis r . Mathematically it is equivalent to having ft • r = which 
results in an extra constraint on the components of ft, 

x Vl x + 2/ ft y + ft, = 0. (5.2) 

This constraint guarantees that the value of u;r obtained by applying the system of 
equations 4.4 to the two initial images is also valid for the fixated images. As a result, 
no adjustment in u>n is needed before using it in equations 3.35 and 3.36 which must 
be applied to a sequence of fixated images. 

Considering that the fixation velocity (u ,v ) and the fixation point coordinates 
x and y Q are known here, the equivalent rotational velocity ft is obtained by solving 
the combination of three linear equations in 5.1 and 5.2. For example, in the case 
that the fixation point is at the principal point, x - y = 0, the equivalent rotational 
velocity becomes, 

n = (t; o ,-u o ,0). (5.3) 

However, it should be emphasized that fixation point is not restricted to the principal 
point and virtually any point can be chosen as the fixation point. 
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Chapter 5: Constructing a Sequence of Fixated Images 



5.2 Constructing the 2nd Fixated Image 

After obtaining the equivalent rotational velocity ft, the task of constructing the 
2nd fixated image is equivalent to finding the transformation experienced by the 2nd 
initial image when the imaginary rotation —ft is applied to the vision system. 

Considering eqn. 5.1, the following set of equations give the component of the 
corresponding shifting vector (u, v) for any pixel (x, y) of the 2nd initial image 



u = —xyfl x + (x 2 + l)ftj, - j/ft 2 

v = ~(y 2 + l )^x + xyVty + xVt z . 



(5.4) 



Here ft x , ft y and ft, are known values. As a result, the shifting vector (u,v) can be 
obtained for every pixel of the 2nd initial image. 

Figure 5-2 shows the process of constructing the 2nd fixated image using the 2nd 
initial image, called the pixel shifting process. The brightness at pixel (x, y) of the 2nd 
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Figure 5-2: The pixel shifting process for constructing the fixated 2nd image from the 
2nd initial image. 



fixated image is the same as the brightness at the corresponding point (x — Tu,y — Tv) 
in the 2nd initial image, where T is the time interval between two initial images. In 
general, a computed original point is not located at the center of a pixel in the 2nd 
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initial image. As a result, its brightness cannot be read directly from the image file 
and should be computed by averaging, bilinear interpolation or bicubic interpolation 
of the brightnesses at its neighboring pixels. 

It should be clear by now that we neither require the fixated images to be pro- 
vided in advance nor do we use mechanical tracking for obtaining the fixated images. 
Construction of the 2nd fixated image is based on the pixel shifting process. This is 
done entirely in software and no tracking is involved in this technique. In chapter 11, 
we will show the results of implementing this purely software based technique for 
constructing a sequence from fixated images for several real image sequences. 
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An Overview of the Fixation 
Method 

Chapter 6 



The algorithms and formulations presented in the previous chapters show how to 
solve directly for the motion and shape in the general case. In contrast to previous 
work done in the area of motion vision, our technique is general and does not put 
any severe restrictions on the motion or the environment. More importantly, the 
fixation method uses neither optical flow nor feature correspondence. Instead, image 
information such as temporal and spatial brightness gradients are used directly. This 
method neither requires tracked images as input nor uses tracking for obtaining fixated 
images. Instead, it introduces a pixel shifting process for constructing fixated images 
at any arbitrary fixation point. This process is done entirely in software without 
moving the camera for tracking. 

In the previous chapters, we gave the theory underlying the fixation method in 
detail. This chapter presents a summary of the main steps involved in the fixation 
method. 

6.1 Main Modules 

Figure 6-1 shows a block diagram of the ideas behind our fixation based motion 
vision system. Referring to this figure, the fixation method can be implemented in 
the following steps: 

• Step 1: Finding the fixation velocity components (u , v ) and the component of 
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Figure 6-1: The modules of the fixation based general motion vision system. 

rotational velocity along R , wr , by applying the system of eqn. 4.4 to the brightness 
gradients from two initial images. 

• Step 2: Knowing the fixation velocity components (u , v ) the 2nd fixated image 
is constructed by the pixel shifting process explained in chapter 5. This is done entirely 
in software without any need to move the camera for tracking. This step also results 
in the estimation of the equivalent rotational velocity £i. 

• Step 3: Knowing wr , and using the fixation constraint equation 3.15, the 1st 
initial image, and the 2nd fixated image, the method presented in chapter 3 can be 
used for recovering the translational velocity t, the partial rotational velocity w, and 
the depth Z at all image points. 

• Step 4: The total rotational velocity u} to t is obtained simply by adding the equiva- 
lent rotational velocity fl, from equations 5.1 and 5.2, to the partial rotational velocity 
uj from eqn. 3.15. 

In the following chapters, we apply our fixation based motion vision system to 
the real world environment to recover motion and shape in the general case. At 
every step, we discuss the implementation issues and introduce practical techniques 
for dealing with them. 



Spatial and Temporal Brightness 
Gradients 

Chapter 7 



Brightness gradients are the primary source of information for direct method al- 
gorithms. Appendix B describes the formulations for obtaining spatial brightness 
gradients E x and E y , and the temporal brightness gradient E t from a sequence of two 
time varying images. 

This chapter applies those formulations to two real image sequences to obtain the 
corresponding brightness gradients. Then, we will introduce a technique for the visual 
representation of the brightness gradients and finally, we will study those representa- 
tions to explain the significance and characteristics of brightness gradients. 

7.1 Visual Representation 

Two successive frames of the landscape image sequence (taken at the Imaging Labo- 
ratory of Carnegie Mellon University) are shown in fig. 7-1. These are 8 — bit images 
but the last two digits are usually too noisy to be reliable. 

The true motion between these frames is a combination of translation and rotation. 
The real rotation is 0.3 deg about the optical axis Z and the real translation is 2 mm 
along the horizontal axis X. 

Using the formulation in appendix B, we can compute the brightness gradients. 
The corresponding spatial and temporal brightness gradients for the landscape image 
sequence are shown in figures 7-2 and 7-3, respectively. 
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Figure 7-4 shows another image sequence [cup image sequence) used in the experi- 
ments. The motion between these successive frames is a 3D translation of (2.5, 0, 4) mm. 
The spatial and temporal gradients for the cup image sequence are shown in figures 7-5 
and 7-6, respectively. 

In these maps, larger gradient values are shown brighter. Such gradient maps 
suggest a way of visually representing the brightness gradients which renders them 
more intuitively meaningful. 



7.2 Interpretation and Significance 

The top gradient maps in figures 7-2, and 7-5 show that horizontal gradients (E^s) 
capture the vertical lines and feature in the images. Similarly, the bottom gradient 
maps in these figures demonstrate that vertical gradients (Ey's) pick up the horizontal 
lines and feature in the image. 

These experimental results show that the spatial gradients capture the geometric 
and shading characteristics of the images. It is important to notice that the compu- 
tation behind spatial gradients is very simple. However, they indirectly capture the 
edges, features, and boundaries of the scene. 

The temporal brightness gradient in fig. 7-3 tells us about the motion between 
two landscape images. First of all, the vertical lines and features are seen all over this 
temporal gradient map. This observation indicates that the motion has a horizontal 
translation component. 

Secondly, there are also horizontal lines in this gradient map but they become 
weaker as they get close to the left side of the map (this argument becomes more 
obvious if one compares the horizontal lines in here with those of E y in fig. 7-2). This 
means that motion has a rotational component which is centered in the left side of 
the image. In section 13.2, we will show that this is really the case. 

Also, we can observe that at any vertical stripe of the spatial gradient map, 
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the horizontal lines become stronger as their distance from the center of the stripe 
increases. This observation indicates that the rotation center is located in the middle 
of the image. 

Figure 7-6 shows that the temporal brightness gradient map captures the vertical 
edges and features in the cup image sequence. The uniform strength of the vertical 
lines in fig. 7-6 is an indication of the fact that the motion in the cup image sequence 
is a pure horizontal translation. 

7.3 Summary 

The gradient maps and discussions presented in this chapter show that the spatial 
gradients capture the geometric and shading characteristics of the images and the 
temporal gradients contain important information about the motion. 

As shown in appendix B, the computational procedure behind gradient estimation 
is very simple. In fact, it only involves the subtraction of neighboring pixel values. 
Such a simple computation indirectly results in capturing the motion and detecting 
the features, edges, and boundaries in the images. 

However, we should emphasize that we neither intended to obtain such edges and 
features nor did we use such representation of the gradient maps in our algorithms. 
The intention was to demonstrate that the brightness gradient maps not only contain 
the motion information (which is usually represented by the optical flow maps) but 
also have a flavor of features and edges (used in edge maps and feature correspondence 
algorithms). 
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Figure 7-1: The first and second frames in the landscape image sequence. The true 
motion is a 0.3 deg rotation about the nominal optical axis Z, and a 2 mm translation 
along the horizontal axis X. 



7.3: Summary 
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Figure 7-2: The visual representation of the spatial brightness gradients for the land- 
scape image sequence in the horizontal direction (top) and vertical direction (bottom), 
E x and E y . The horizontal gradient map (top) has captured the vertical edges and 
features in the image. Similarly, the vertical gradient map (bottom) has picked up 
the horizontal edges and features. 
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Chapter 7: Spatial and Temporal Brightness Gradients 




Figure 7-3: The visual representation of the temporal brightness gradient for the 
landscape image sequence, E t . The vertical edges with relatively uniform strength 
suggest that motion has a horizontal translation component. The horizontal edges 
with decreasing strength towards left indicate that there is also a rotation centered 
at the left of the image center. 



7.3: Summary 
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Figure 7-4: The first and second images in the cup image sequence. The true motion 
between these frames is a 3D translation of (2.5, 0, 4) mm. 
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Chapter 7: Spatial and Temporal Brightness Gradients 




Figure 7-5: The visual representation of the spatial brightness gradients for the cup 
images in the horizontal direction (top) and vertical direction (bottom), E x and E y . 
The horizontal gradient map (top) has captured the vertical edges and features in the 
image. Similarly, the vertical gradient map (bottom) has picked up the horizontal 
edges and features. 
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7.3: Summary 
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Figure 7-6: The visual representation of the temporal brightness gradient for the cup 
image sequence, E t . The presence of relatively uniform vertical edges and features 
here indicates that the motion is predominantly a horizontal translation 
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The Effect of Fixation Patch Size 



Chapter 8 



Finding the fixation velocity (velocity at the fixation point), and the component 
of rotational velocity about the fixation axis, wr , is the most important part of 
our fixation based method for recovering the shape and motion from an arbitrary 
sequence of input images. This is because in our method a pixel shifting process 
uses the fixation velocity to construct a sequence of fixated images from an arbitrary 
sequences of input images (chapter 5). We also need u;r, for computing the total 
rotational velocity (chapter 3). 

In chapter 4 we introduced the algorithms for recovering the fixation velocity and 
u;r using the information from the fixation patch (an image patch around the fixation 
point). In this chapter, we study the effect of the fixation patch size on the estimation 
of the desired motion parameters using two different sequence of images where the 
motion is a combination of translation and rotation. 

8.1 Images with Moderate Relative Depth Changes 

Here, we have used a sequence of real images acquired at the Imaging Laboratory of 
Carnegie Mellon University. Figure 7-1 shows two of these 576 x 384 pixels images. 
The relative depth is moderate (1250 mm to 1625 mm, about 30% change) in the 
image portion used in our computations. The camera has a nominal focal length of 
24 mm, and a pixel size of 0.02 x 0.02 mm. The calibrated principal point has been 
used as a fixation point. The calibration technique is explained in section 13.1. 
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In a raster format system (origin at the top left corner of the image), the calibrated 
principal point is located near the center of image, pixel (275, 205). The frontal depth 
of this point is about 1450 mm. 

The real motion between these two images has both translational and rotational 
components. The real rotation is —0.3 deg about the optical axis Z and the real 
translation is —2 mm along the horizontal axis X. Testing our algorithms using 
such real images is valuable because the observed motion is relatively large (more 
than subpixel motion in the image plane). For very large motions it is enough to 
use higher frame grabbing rates. These days, there are commercially available frame 
grabbers which are capable of capturing up to 7,500 frame per second at 12 bits gray 
scale resolution on personal computers [82]. 

Using the algorithm described in chapter 4 we can find the horizontal and vertical 
translations and the rotational component u>r for any given fixation patch size. The 
corresponding plots are shown in figures 8-1, 8-2 and 8-3. It is evident that these 
estimations strongly depend on the fixation patch size especially when the fixation 
patch is small. Figure 8-1 shows that the horizontal translation converges to its real 
value (—2 mm). On the other hand, the vertical translation (fig. 8-2) converges to 
0.9 mm which is not its true value. The reason for this disparity is described in 
section 13.2. 

Figure 8-3 shows that for small patch sizes (less than 30 x 30 pixels in this case) the 
estimated value for lo-& o oscillates wildly and results in unacceptable values. As the 
patch size increases, the estimated u>r, converges towards the real value of rotation. 
For large patch sizes (around 100 x 100 pixels in this case) the estimated rotation, 
—0.309 deg, becomes roughly the same as the real rotation, —0.3 deg. 



8.2: Images with Significant Relative Depth Changes 
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Figure 8-1: The estimated value for the horizontal translation versus the fixation 
patch size for the landscape image sequence. The true horizontal translation is 
—2 mm. 



8.2 Images with Significant Relative Depth Changes 



In this section we will study another image sequence (cup images) which have consid- 
erable relative depth changes within the fixation patch (584 mm to 914 mm, about 
60% difference). Figure 7-4 shows two of these 227 x 280 pixels images (cup images). 

The real motion of the viewer is a horizontal translation of 2.5 mm to the right. 
The camera has a nominal focal length of 18.66 mm, pixel-width of 0.032 mm, and 
pixel-height of 0.029 mm. We have used the nominal principal point (image center) 
as our fixation point. 

Figure 8-4, shows the estimates for the horizontal translation, vertical translation, 
and the rotational velocity component ur o . It is obvious that the estimated values 
depend strongly on the size of the fixation patch. We can find good estimates for 
these motion parameters if we use the right fixation patch size. 
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Figure 8-2: The estimated value for the vertical translation versus the fixation patch 
size for the landscape image sequence. The true vertical translation is zero which is 
apparently different from the experimental results (about « -0.9 mm). In chapter 13, 
we will show that this considerable difference is due to a calibration problem. 

8.3 Finding a Good Estimate for o; Ro Autonomously 



It can be seen that the size of fixation patch has a critical effect on the estimated 
values of the component of rotational velocity about the fixation axis, cjr, . A small 
patch size results in a value for wr which is usually far distant from the true value. 
This is possibly because in a small patch, small translations can be interpreted as 
large rotations. Figure 8-5 shows a hypothetical situation where (a) and (6) are a 
sequence of a small 3x3 pixels patch. The real motion in this case is most likely 
a pixel high vertical translation. But if we try to interpret it as a rotation about 
the patch center we will end up with a 45 deg rotation which is not acceptable, 
considering the assumed small motion between images. 

As a conclusion, we can autonomously find a good estimate for the rotational 
velocity component u^ simply by using a relatively large fixation patch size. 



8.4: Updating the Fixation Velocity Using u>r, 
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Figure 8-3: The estimated value of the component of rotation velocity about the 
fixation axis, u; Ro , versus the fixation patch size for the landscape image sequence. For 
large patch sizes, the estimated value of u> Ro (about -0.309 deg) converges towards 
the real value of u>r, , —0.3 deg. 



8.4 Updating the Fixation Velocity Using cjr 



In the previous section, we saw that a good estimate for u>r, can be found using a 
relatively large patch but the corresponding fixation velocity estimate from such a 
large patch is usually not reliable. This observation suggests that we may be able to 
obtain better estimates for the fixation velocity components if we use the estimated 
value of cjr and recompute the fixation velocity. 

Using only the estimate for u;r from a large patch, we can compute the total 
motion field at any point (x,y) on a small patch around the fixation point (fixation 
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Figure 8-4: The estimated values for the horizontal and vertical translations and the 
rotational component wr versus the fixation patch size for the cup image sequence. 
The true motion is a horizontal translation of 2.5 mm. 



patch). As we showed in chapter 4 



x t = u + 



Ro 



yt = v o 



\A? +2/2+1 
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(y-i/o) 

(a; — a; ) 



(8.1) 



where (a; , j/ ) is the position of fixation point (located in the image plane), and 
(u ,v ) is the fixation velocity that we are about to estimate. After substituting x t 
and y t into the BCCE, eqn. (2.7), we will have 



«o + 



<^R„ 



y/*l + Vl + 



-.(y - y ) E x + v - 



WR, 



y/xl + yl + l 



(x - x ) E y + E t = 0. (8.2) 



However, due to noise, the above equation does not necessarily hold for any pixel. As 
a result, we can find u and v by minimizing the sum of the errors over the whole 
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Figure 8-5: Using small fixation patch can result in wrong interpretation of large 
rotation. In a patch of 3 x 3 pixels, a pixel high vertical translation can be interpreted 
as 45 deg rotation which is not an acceptable answer at all, considering the finite 
motion between images. 



fixation patch, namely by minimizing 
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with respect to u and v . This will result in the following system of linear equations, 
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that can be solved for the two unknowns u , and v . Note that ojr has been already 
computed and is a known value in this equation. 
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8.4.1 Improved estimations 

Here, we use the updated algorithms (which take advantage of a good o>r estimation) 
to find estimations for the translational components of the fixation velocity. 

Figures 8-6 and 8-7 compare the updated and previous estimations of the horizon- 
tal and vertical translations in the landscape images. These figures show that there 
are some improvements in the updated estimations especially for the vertical transla- 
tion (fig. 8-7). The improvements in the updated estimations are more pronounced 
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Figure 8-6: The updated and previous estimations of the horizontal translation, along 
the X-axis, versus the fixation patch size for the landscape image sequence. 



in the plots corresponding to the cup images (figures 8-8 and 8-9). Note that we have 
better improvements where there is the most need for it, namely in the cup images 
where relative depth variations is large compared to the landscape images. 

Despite improvements, the dependency of the updated translational components 
on the fixation patch size is still quite clear in these figures. However, we can find good 
estimates for these motion parameters if we choose the right fixation patch size. In 



8.4: Updating the Fixation Velocity Using ur 
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Figure 8-7: The updated and previous estimations of the vertical translation, along 
the V-axis, versus the fixation patch size for the landscape image sequence. 

practice, we do not know the real fixation velocity, and therefore we cannot select an 
appropriate fixation patch size by checking the computed values of the translational 
components. The next chapter introduces a technique for autonomous choice of an 
optimum fixation patch size. 
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Figure 8-8: The updated and previous estimations of the horizontal translation, along 
the X-axis, versus the fixation patch size for the cup image sequence. 
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Figure 8-9: The updated and previous estimations of the vertical translation, along 
the F-axis, versus the fixation patch size for the cup image sequence. 
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The experimental results and explanations in the previous chapter suggest that 
relatively large patch sizes should be used in order to get a good estimate for the 
component of the rotation along the fixation axis, wr . On the other hand, we know 
that in general using a very large patch size will result in a wrong estimate for the 
fixation velocity because depth variations usually increase as the patch size increases. 

Figures 8-1 and 8-4 showed that for any image sequence, there is an optimum 
patch size which results in good estimates for the fixation velocity components. The 
corresponding optimum patch size is about 100 x 100 pixels for the landscape image 
sequence (fig. 8-1) and about 50 x 50 pixels for the cup image sequence (fig. 8-4). 

In this chapter, we will describe an autonomous technique for finding the optimum 
fixation patch size which results in good estimates for the fixation velocity components 
for any image sequence. 

9.1 Normalized Error 

We showed that for any given size of the fixation patch, we can find the fixation veloc- 
ity components, u and v . Also the component of the rotational velocity about the 
fixation axis, u> Ro , can be estimated reliably using a relatively large patch. Knowing 
these values, the motion field velocity (x t , y t ) at any point (x, y) in the image plane is 
given by eqn. 8.1. Ideally, for any given image point (x,y) the BCCE, eqn. 2.7, must 
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be satisfied. However, in practice we are dealing with real images which are noisy and 
as a result, the term x t E x + y t E y + E t does not usually become zero. This term can 
be considered as an error term for the corresponding pixel. In a patch of size p x p 
pixels, we can add these error terms to define the normalized error, e, as 

Y\x t E x + y t E y + E t ] 2 ^ 

P 2 

This definition allows us to compare the performance of different patch sizes by study- 
ing the behavior of the normalized error e with respect to the changes in the patch 
size p. 



9.2 Optimum Patch Size 

In this section, we show how the normalized error can be used for finding an optimum 
patch size which results in good estimates for the components of the fixation velocity. 
Any patch of a real image may include a substantial depth range. In general, there are 
two main groups of images. In the first group, there are moderate changes in depth 
variation as the patch size increases. The second group represents images where the 
depth variation increases significantly as the patch size increases. 

9.2.1 Moderate changes in relative depth 

Figure 9-1 shows the normalized error versus the fixation patch size for the landscape 
image sequence. Although this plot corresponds to a specific image and motion, it 
shows one of the two typical representations of the normalized error behavior as the 
patch size increases. As shown in this figure, the normalized error first increases with 
the patch size, reaches a peak and then dips down. 

This is because initially for the smallest patch size (3x3 pixels) the algorithm 
finds the motion estimates that makes the BCCE error term (x t E x + y t E y + E t ) as 
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Figure 9- 1 : The estimated value of the normalized error e versus the fixation patch size 
for the landscape image sequence. The optimum patch size occurs around 100 pixels. 

small as possible. The algorithm does a good job in minimizing the total of 9 error 
terms in this small patch but the motion estimates are usually very bad at this level 
because basically there are not enough data available to the algorithm. 

In the next level, we have a patch of 5 x 5 pixels size which provides more data. 
While there is still not enough data for the algorithm to come up with good motion 
estimates, it finds parameters which minimize the sum of the BCCE error terms. 
However, the algorithm is not usually as successful as it was for the 3x3 pixels patch 
size because it has to deal with more error terms and this results in higher normalized 
error. 

As we increase the patch size, the struggle between providing more data to the 
algorithm and satisfying more error terms continues. For relatively small patch sizes, 
this results in higher normalized error. The normalized error increases until it reaches 
a peak where the role of extra input data becomes more important than satisfying 
more error terms. Then by increasing the patch size, we are providing more data 
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to the algorithm and this gives a better motion estimate and results in a smaller 
normalized error. 

After dipping down, the normalized error stays roughly the same in this case, 
because the relative depth variation does not change much with the patch size, (fig. 9- 
1). The optimum patch size in this example occurs around 100 x 100 pixels which 
corresponds to the start of the small slope in normalized error, a roughly fiat portion 
after the first peak. In this example, relative depth changes are moderate (1250 mm 
to 1625 mm, about 30% difference) and stay roughly the same as the patch size 
increases. 

9.2.2 Significant changes in relative depth 

The normalized error for the cup image sequence is shown in fig. 9-2. As before, the 
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Figure 9-2: The normalized error versus fixation patch size for the cup image sequence. 
The optimum patch size occurs around 50 pixels. 

normalized error first increases and after reaching a peak it dips down and then grows 
with the patch size again. This is because in the beginning, insufficient information 
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results in extremely wrong estimates and this causes the normalized error to increase 
with the patch size. As we are providing more and more data to the algorithm, we 
obtain better estimates for the motion components and this decreases the normalized 
error. If we increase the patch size beyond an optimum size, which occurs at about 
50 pixels in this example, the normalized error starts increasing again. In this 50 x 50 
pixels patch, we have a considerable amount of relative depth changes (from 584 mm 
to 914 mm, about 57% increase). Such significant relative depth variation leads 
to larger errors in the fixation velocity estimates which in turn results in a larger 
normalized error as p grows. 



9.3 Autonomous Choice of Optimum Patch Size 

As one might expect, the optimum fixation patch size depends on the patch topology 
and texture which may vary from image to image. However, the general pattern of 
the normalized error allows us to autonomously find an optimum fixation patch size 
which gives good estimates for the fixation velocity components. 

In the case where considerable changes in the relative depth occur with patch size 
increase, as in the cup image sequence, the optimum fixation patch size corresponds 
to the minimum normalized error that occurs after the peak value of the normalized 
error. And in cases where the relative depth does not change significantly with patch 
size, as in the landscape image sequence, the optimum fixation patch size is where 
the normalized error does not change considerably as the patch size increases. 

A human operator may not have much problem identifying the optimum patch 
size on the normalized error plots. But our aim is to come up with a simple algorithm 
which allows a machine to autonomously find the optimum patch size from any given 
normalized error data set. 
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9.3.1 Algorithm 

This section describes the algorithm for obtaining the optimum fixation patch size 
from any normalized error data set. The general algorithm is composed of the fol- 
lowing steps: 

• Step 1: Setting the patch size bounds 

All the experimental results unanimously show that the motion estimates from a 
small patch are not reliable at all. As a result, we can put a lower bound on the patch 
size. By taking into account the camera parameters and the image size, we have used 
a 15 x 15 pixel patch as the lower bound of the patch size. Moreover, the square 
shape of the patch, the location of the fixation point, and the image size dictates an 
upper bound on the patch size. As a result, we have used 140 x 140 pixels as the 
upper bound in our experiments. 

• Step 2: Computing the normalized error slope 

Denoting the normalized error at patch i as e[i], we define the slope at patch i as 

= efr+11-eH (9 2) 

e[i] 

The slope S[i] is dimensionless and shows the relative change of the normalized error 
as the patch i changes to patch i + 1. 

• Step 3: Setting a slope index 

By searching through the slope space, we can find the steepest (most negative) 
slope and denote it as Smax. This definition allows the algorithm to get a sense of 
steepness (or flatness) at any point on the normalized error curve. We define the slope 
index Sind as a small percentage (about 15%) of the steepest slope Smax. Study 
of many normalized errors plots has shown that this choice of the Sind allows us to 
identify relatively flat portions in a typical normalized error curve. 

• Step 4: Searching for the optimum patch size 

We choose the lower bound patch size as the first candidate for the optimum size. 
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Then, we move to the next patch size and select it as the new nominated optimum 
patch size if it satisfies the following two conditions: 

- First condition: Its normalized error e[i] should be less than the normalized error 
value of the previously nominated optimum point. 

- Second condition: Its corresponding slope S[i] should be steeper (more negative) 
than the slope index, Sind. 

We continue this search process until we reach the upper bound of the patch size. 
• Step 5: Locating the optimum patch size 

After checking all the data, the point immediately after the last nominated point 
is selected as the optimum point. 

9.3.2 Experimental results 

The above algorithm has been applied to the normalized error data set of the land- 
scape and the cup image sequences (figures 9-1 and 9-2) to obtain the optimum patch 
sizes. The corresponding experimental results of locating the optimum patch size are 
shown in figures 9-3 and 9-4. In these figures, the nominated optimum points are 
shown by small circles on the normalized error curves. It can be seen that for both 
cases the algorithm finds the optimum points correctly. 

Figure 9-3 shows that the optimum patch size for the landscape image sequence is 
selected at 101 pixels which corresponds to a small field of view (about 2 x 2.4 deg). 
If we go back to figures 8-6 and 8-7 again, we see that one of the best estimations 
for the translational components occur at this optimum patch size (101 pixels). The 
optimum patch size for the cup image sequence is selected at 47 pixels (fig. 9-2). 
Similarly, figures 8-8 and 8-9 show that we obtain one of the best combined motion 
estimates at this optimum point (47 pixels). This optimum patch size for the cup im- 
age sequence makes approximately the same field of view as the one for the landscape 
image sequence (about 2 x 2.4 deg). This is an important observation considering 
that we have obtained roughly the same optimum field of view for two totally different 
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Figure 9-3: Searching process of finding the optimum patch size for the landscape 
image sequence. The nominated points are shown by small circles. The last point 
represents the optimum point which occurs at 101 pixels in this case. 

images, cameras, and focal lengths. 



9.3.3 Further results 

In order to test our algorithm further, we have run it on many other image sequences 
with smaller and larger motions. The algorithm has worked successfully in finding 
the optimum patch sizes in all cases. Some of the corresponding experimental results 
are shown in figures 9-5, 9-6, 9-7, and 9-8. These experimental results for the other 
images sequences show that the corresponding optimum patch sizes are close but not 
necessarily the same as the values we obtained before. However, in every case the 
obtained optimum point represents the patch size which results in one of the best 
motion estimates. 
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Figure 9-5: Searching process of finding the optimum patch size for the landscape,20- 
30 image sequence. The motion is two times as large as before (—4 mm translation 
and —0.6 deg rotation). The nominated points are shown by small circles. The last 
point represents the optimum point which occurs at 101 pixels in this case. 
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Figure 9-8: Searching process of finding the optimum patch siae for the cup25 image 
sequence. The motion is -2.5 mm tmt&Akm. 3§ie -maaAfmA points are shown by 
small circles. The last point represents the optimum point which occurs at 49 pixels 
ia this case. 
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Autonomous Choice of an 
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Chapter 10 



In general, our fixation algorithms do not place any restrictions on the choice 
of the fixation point location and virtually any point can be chosen as the fixation 
point. Among all points, the choice of principal point (image center) makes the 
formulations simpler. However, in practice, one should take some more considerations 
into account while choosing an appropriate fixation point. Most significantly, the 
motion of the chosen fixation point should be detectable using the information from 
its corresponding patch. To clarify this, we can consider a patch which has a uniform 
brightness. Choosing the center of such a patch as the fixation point will not be 
useful, because the motion of such a point is irrecoverable using only the information 
from that patch. This chapter introduces a technique for autonomous choice of an 
appropriate fixation point. 

10.1 Algorithm 

Similar to chapter 4 (when using u;r = 0), the least squares method can be applied 
to the BCCE terms to obtain the following system of linear equations for the uniform 
motion field (u,v) on a patch as 



Jf p E 2 x dxdy Jf p E x E y dxdy 
JLE x E y dxdy ff p E 2 dxdy 



'„^ 



\ v ) 



-JJ p E t E x dx dy 
-ff p E t E y dxdy 



(10.1) 
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It is obvious that the solution for (u,v) exists (i.e. motion is detectable) if the deter- 
minant of the above matrix 

D = (J J E 2 x dx dy} (J J E 2 y dx dy} - (J J E x E y dx dy}" (10.2) 

is not zero. However, this is not a reliable criteria for real images because due to noise 
we may have D ^ but it does not guarantee that the patch is an appropriate one. 
If we denote the smaller eigenvalue of the coefficient matrix in eqn. 10.1 by A s , 

As = \ (JI P (EI + E 2 )dx dy - y/fUE* - El)Hx dy + 4(// p E x E y dx dy)*) (10-3) 

then we can define a good fixation point as a point whose corresponding patch has 
the largest A s . Using such a patch not only guarantees a solution (D ^ 0) but also 
ensures that our solution (u, v) is not sensitive to noise errors in the coefficient matrix 
of eqn. 10.1. 

The reasoning behind using the largest X s is the form of the characteristic poly- 
nomial of the coefficient matrix in 10.1, 

F{\) = A 2 -2 (J J (El + E 2 y )dx dy} A+ (J J E 2 x dx dy} {^j J E 2 dx dy} - {J j E x E y dx dy 

(10.4) 
When A is large, small errors in the coefficients results in negligible error in F(A) 
compared to the case when A is small. This implies that in patches with larger 
A s , the apparent motion components (u,v) are less sensitive to small errors in the 
coefficients which may occur due to noise. 



10.2 Discussion 

It is easy to implement the A s criteria for autonomous choice of a good fixation point. 
This criteria results in reliable choices for the fixation point even in real noisy images. 



10.2: Discussion . . £** 

For patches with relatively uniform brightness the A s is small which means that we 
should avoid choosing the fixation point in such a patch. We will get larger and larger 
A s 's as we choose patches with more features and brightness variations. 

We have addressed the question of finding an appropriate fixation point (the center 
of a fixation patch) among a number of given patches. But which patches should we 
check in the first place? We can search the whole image for a globally optimum 
location of a fixation point in the following steps: 

• Step 1: Divide the whole image into 4 quadrants and find the corresponding X s 
for each quadrant. 

• Step 2: Use the quadrant with the largest A s as a new base image. 

• Step 3: Repeat steps 1 & 2 until reaching a quadrant with an acceptable size. 
However, performing such a comprehensive search may not always be necessary. 

Instead, we can check a limited number of neighboring patches (near the principal 
point, for convenience) and choose the center of the one with the largest \ s as the 
fixation point. 
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Tracking without Moving the 

Camera 

Chapter 11 



The fixation method requires a sequence of fixated (tracked) images as its input. 
However, in general the acquired image sequences may not be fixated at any point 
and even if they are it is not easy to find that fixation point. 

Our fixation method does not depend on how the fixated images are obtained. 
But along the course of this thesis work, we were confronted with the challenge of 
constructing a sequence of fixated (tracked) images from an arbitrary image sequence. 

This chapter describes the experimental results and the implementation issues in- 
volved in constructing sequences of fixated images from several real images sequences. 

11.1 Background 

The task of constructing a sequence of fixated images is, in essence, the well known 
tracking problem. People have been working on different aspects of this problem using 
various techniques for many years [43, 22, 53]. For example, Aloimonos & Tsakiris 
[5] propose a method for tracking a foveated target of known shape; Bandopadhay et 
al. [10] use optical flow and feature correspondence for tracking the principal point 
in order to find the motion in a special case (they assume that there is no rotation 
along the optical axis) without considering noise; and Sandini & Tistarelli [52] use 
an optical flow based tracking method for finding the depth in a special case (no 
rotation along the optical axis). All these methods use optical flow and/or feature 
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correspondence and address only special cases. There has also been some work on 
using visual tracking for finding the trajectory of an object moving in an environment 
[15, 90]. 

Traditionally, tracking has been associated with mechanically moving the camera 
to keep the image of a particular point stationary at the image center. Some tech- 
niques even rely on such a system. For example, Thompson [74] introduces an optical 
flow method for recovering the motion in special case where the rotational velocity 
along the optical axis is zero. His method requires a sequence of tracked images at the 
principal point but he acknowledges that the actual implementation of such tracking 
requirement in engineering systems is not possible yet. 

Hardware tracking is done by physically moving the camera with respect to the 
environment. Considering that in general the point of interest has a motion relative 
to the observer, the 2nd fixated image cannot be obtained in one step. As a result, 
feedback control loop is required for the camera rotation system to compensate for 
the errors resulting from the new position of the fixation point [46, 20, 24, 37, 89, 
19]. These difficulties and other problems such as expense, real time response, and 
potential errors involved make mechanical tracking unattractive especially for our 
vision system. 



11.2 Pixel Shifting Process 

Here, we use the pixel shifting process described in chapter 5 for constructing a se- 
quence of fixed images from an arbitrary image sequence. This method solves the 
tracking problem in its most challenging case. In other words, it does not require 
any knowledge about the motion or shape. Furthermore, the fixation point is not 
restricted to the principal point (image center) and virtually any point can be chosen 
as the fixation point. The pixel shifting process is done purely in software without any 
need to mechanically move the camera for tracking. It is computationally simple and 



11.2: Pixel Shifting Process 



97 



uses neither optical flow nor feature correspondence. Instead, brightness gradients of 
the initial input images are used directly. 

11.2.1 Bilinear Interpolation 

We showed that constructing a fixated image is the same as finding the brightness E 
for any pixel (x,y) of such an image, (see chapter 5). We proved that the brightness 
E at pixel (x,y) of the 2nd fixated image is the same as the brightness at the pixel 
(x - Tu, y - Tv) of the 2nd initial image where the shifting vector (u, v) is given by 
eqn. 5.4 and T is the time interval between two initial images. 

In practice, the point (x - Tu,y - Tv) does not exactly coincide with any pixel. 
Instead it is usually surrounded by four pixels whose brightnesses may be denoted by 
Eij, E itj+1 , E i+ ij, and E i+ltj+u fig. 11-1. In this figure, p and q are the horizontal 
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Figure 11-1: The mapped point in the 2nd initial image does not usually coincide 
with any single pixel. Instead it is usually surrounded by four pixels. 
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and vertical distances of the mapped point from pixel (i,j). Considering that this 
can happen for any pixel, the average \(Eij + -Eij+i + £;+i,j + #t+i,j+i) ls not a g°°d 
estimation for E because it corrupts the constructed image by introducing aliasing. 
Bilinear interpolation of the surrounding brightness levels has proven to be a very 
good estimate for E which is given as, 

E = (1 - p)(l - q)Eij + p(l - q)E iJ+l + q(l - p)E i+1>j + pqE i+hj+l . (11.1) 

As shown in fig. 11-1, p and q represent the horizontal and vertical distance of the 
mapped point from pixel (i,j). Such an algorithm gives the largest weight to the 
pixel closest to the mapped point and results in the exact brightness value when it 
coincides with any pixel, p = q = 0. 

All the constructed images in this work are obtained using bilinear interpolation. 
Our experimental results have shown that such interpolation is quite satisfactory. 
There are some other techniques such as bicubic interpolation [1, 13, 32, 49, 50] which 
are much more expensive, however we did not find that we needed to use them in this 
work. 

11.3 Construction of Fixated Images 

The landscape and cup image sequences in figures 7-1 and 7-4 are used as input 
(initial) images in the following experiments. As we discussed earlier, the 1st initial 
images (top images) in those figures are directly used as the 1st fixated images. Then 
the pixel shifting process and the bilinear interpolation are applied to the 2nd initial 
images (bottom images in figures 7-1 and 7-4) to construct the 2nd fixated images, 
figures 11-2 and 11-3. These constructed images are quite good and look as natural 
and crisp as the original images do. We will describe the quality of these images 
further in the following sections. 

Depending on the size and direction of the equivalent rotational velocity il (see 



11.4: Spatial and Temporal Gradient Maps 



99 




Figure 11-2: The constructed, 2nd fixated, image for the landscape image sequence. 

chapter 5), the brightness E at some border pixels are not computable because they 
are mapped to points outside the initial images domain. The brightness at such 
bordering pixels are given an arbitrary value of which causes the appearance of 
bold black lines at the border of constructed images. This should not concern us 
because in general the results near the image borders are not considered reliable 
anyway. 



11.4 Spatial and Temporal Gradient Maps 



The gradient maps are good measures for studying the quality and characteristics of 
fixated image sequences. This section examines the gradient maps of two different 
fixated image sequences that we have constructed from real image sequences. 
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Figure 11-3: The constructed, 2nd fixated image, for the cup image sequence. 

11.4.1 Landscape fixated image sequence 

The combination of the 1st initial image (top image in fig. 7-1.) and the 2nd fixated 
image in fig. 11-2 form the landscape fixated image sequence. The corresponding spa- 
tial gradient maps in fig. 11-4 show that these gradients contain valuable information. 
The vertical and horizontal features of the initial images are indirectly represented in 
the spatial gradients. 

The temporal gradient map of the landscape fixated image sequence is shown in 
fig. 11-5. This map contains very important information. First of all it clearly shows 
the characteristic of a fixated image sequence. It is clear that both the horizontal 
and vertical features of the image sequence become more obvious as their distance 
from the fixation point location (image center in this case) increases. Secondly, the 
appearance of the horizontal and vertical lines here provides hints about the existence 
of a rotational component about the fixation axis. And finally the dominant vertical 
lines are an indication that the equivalent rotational velocity has a major component 
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about the vertical axis. 

11.4.2 Cup fixated image sequence 

The fixated cup image sequence consists of the top image in fig. 7-4 (as the 1st 
fixated image) and the 2nd fixated image in fig. 11-3. Figure 11-6 shows the spatial 
gradient maps for this image sequence. The horizontal gradient map (top) identifies 
the vertical edge-like features and the vertical gradient map (bottom) detects the 
horizontal edge-like features in the image. We should emphasize here that we neither 
intended to find edges nor have we used those. However, it is important to observe 
that spatial gradients (simple horizontal and vertical differences) of fixated images 
indirectly capture important features of the images. 

Figure 11-7 represents the temporal gradient map of the fixated cup image se- 
quence. This map is dominated by vertical lines which indicate that the rotational 
component about the fixation axis is negligible and the equivalent rotational veloc- 
ity has only a component about the vertical axis. Furthermore these vertical lines 
become more evident as their distance from the image center increase which is an 
indication that the fixation point is located near the image center. 

11.5 Summary 

The experimental results in this chapter show that the pixel shifting process can be 
easily used for constructing a sequence of images fixated at any arbitrary point. This 
software based technique is computationally simple and does not require moving the 
camera for tracking the desired fixation point. 

The novel representation of the spatio-temporal gradients by their corresponding 
maps showed that gradients not only preserve the image features but also capture the 
motion in a unique way which reflects the characteristics of fixated image sequences. 
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Figure 11-4: The spatial gradient maps of the fixated landscape image sequence in 
x direction (top) and y direction (bottom). 



11.5: Summary 
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Figure 11-5: The temporal gradient map of the fixated landscape image sequence. 
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Figure 11-6: The spatial gradient maps of the fixated cup image sequence in 
x direction (top) and y direction (bottom) 



11.5: Summary 
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Figure 11-7: The temporal gradient map for the fixated cup image sequence. 
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Depth Map Recovery 



Chapter 12 



This chapter describes how depth maps are recovered from real image sequences. 
It also describes implementation issues and the techniques used in the recovery of 
depth maps. 

12.1 Introduction 

Earlier in chapter 3, we proved that ideally the depth at any point of a fixated image 

is given by eqn. 3.35, 

( S-t ) (12.1) 



(vxRo)-t Hi -rt 

nHr ~ bi ~ R ° ' ° 

where R is the unit vector along the fixation axis and s and v are the known vector 
functions of pixel position (x,y) and spatial gradients (E x , E y ) as given in equations 

2.9 and 2.10. 

The translational velocity t is obtained by finding the eigenvector corresponding 
to the smallest eigenvalue of matrix M in eqn. 3.31. The optimal patch size found in 
chapter 9 is used for the estimation of t. 

All the computations in this chapter are performed using the data from the fixated 
image sequences that we constructed in chapter 1 1 . 
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12.2 Detecting the Depth Flaws 

It is well known that depth recovery from real images is not perfect because of noise 
and other characteristics of real images. This section describes the techniques for 
detecting pixels where depths are not acceptable. 
Using the notations Num and Denom as, 

Num = (s • t)(s • t) (12.2) 



and 



Denom = f ^j^'* " E * " <"R.v • R ) (s • t). (12.3) 



equation 12.1 can be written as, 



Z = Num ■ (12.4) 

Denom 



Using this equation, we can compute depth Z at any single pixel in the image. How- 
ever, the recovered depth is not always reliable. We call a depth Z unacceptable if it 
satisfies any of the following cases. 

• Case 1: Denom is negative. 

This condition results in a negative depth which should not happen in our vision 
system. This usually happens where the data is noisy. 

• Case 2: Denom is zero. 

This case results in an irrecoverable depth (Z — jj) or wrong depth (Z = oo). 

It may occur due to many reasons such as zero translational velocity, in case the 
pixel is in a patch with uniform brightness (zero gradients), or when the apparent 
motion is in a direction perpendicular to the spatial gradients. 

Figure 12-1 shows the depth flaw map for the fixated cup image sequence obtained 
by using the above criteria for detecting the points with unacceptable depth. Any 
black point in this map represents a pixel whose computed depth is not acceptable. 



12.3: Constructing a Primary Depth Map 
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It is quite obvious from this figure that if we compute the depth using only the data 




Figure 12-1: The flaws in the depth map for the fixated cup image sequence. The 
pixels with unacceptable depth are shown in black. 



from a single pixel, then we will end up with considerable number of pixels where 
depths are not acceptable. 



12.3 Constructing a Primary Depth Map 

Figure 12-2 shows the depth map where each depth value is computed using only the 
data from its corresponding pixel. Using such a method leave us with many pixels of 
unacceptable depths which are left blank (white) in this depth map. 

This is a primary depth map and obviously is not very informative because depth 
information is missing in many areas. In the next section the first effort is made for 
estimating the depth at such points. 
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Figure 12-2: The initial depth map for the fixated cup image sequence. The areas 
close to the viewer are bright and the pixels whose depths are not acceptable are left 
blank (white). 



12.3.1 Filling in the Missing Depths 



At any pixel where the depth information is missing (depth is unacceptable), we can 
find a depth estimate by averaging the reliable depths at its surrounding pixels. The 
notation 77 is used for the radius of such a patch. This radius is defined in a way 
that forms a square patch whose side has a length of (2 x 77 + 1) pixels. Figure 12-3 
shows the corresponding completed depth map. A maximum patch size of radius 
r f = 6 pixels has been used for finding an estimate for the points where depths were 
not known in the initial depth map, fig. 12-2. Although this primary depth map is not 
perfect, it delivers very useful clues about the boundary of objects in the environment 
(books, cup, and spoon). 



12. 4-' Improving the. Depth Map 
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Figure 12-3: The completed depth map for the fixated cup image sequence with 
rj — 6 pixels. The areas close to the viewer look brighter. 

12.4 Improving the Depth Map 



We can considerably improve the depth map by using the data from a surrounding 
patch for computing the depth at any pixel point. We denote the radius of such patch 
with r p . Similar to 77, the radius r p is defined in a way to form a square patch whose 
side has a length of (2 x r p + 1) pixels. 

Applying such a simple technique decreases the number of depth flaws and in- 
creases the quality of depth map considerably. Figure 12-4 shows the results when a 
patch of 1 pixel in radius is used for depth computation at any pixel (r p = 1 pixel). 
Although the depth flaws (in the top of the fig. 12-4) have not disappeared, they have 
shrunk noticeably when compared to the previous case. 

The initial depth map is shown in the middle of fig. 12-4 where the pixels with 
unreliable depth estimates are left blank (white). The completed depth map is given 
at the bottom of fig. 12-4 where a patch of maximum 9 pixels in radius (77 = 9 pixels) 
is used for finding depth estimates at points where depths were not known in the initial 
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depth map. The shape of the objects in the image have started to become identifiable 
in this completed depth map. 

12.5 Even Better Depth Maps 

The depth maps can be further improved by using larger patches for depth estimation 
at any single pixel. Figures 12-5 through 12-7 show the depth flaw, initial depth, and 
completed depth maps for cases with patch sizes of radius r p = 2, 3, & 4 pixels. The 
maximum radial patch size for completing the depth map have been rj — 11, 15,, 
and 17 pixels respectively. These maps show that the environment objects (books, 
spoon, cup, and even the background poster) become more identifiable and smoother. 
The experimental results show that if a relatively large initial patch size r p is used 
then depth map may loose some of its fine details. 

12.6 Subsampling the Fixated Images 

In this section, we have subsampled each of the fixated images by a factor of 2 before 
using them for depth recovery. This is done by substituting a patch of 2 x2 neighboring 
pixels with a new pixel whose brightness is an average of 4 initial pixels. This is the 
smallest symmetric subsampling which can be done on an image. We expect to gain 
a better depth map because subsampling usually leads to a decrease in noise. 

The depth flaw (top), initial depth (middle), and complete depth (bottom) maps 
for the subsampled image sequence with r p = are shown in fig. 12-8. These maps 
indicate that some improvements are made by subsampling. This becomes clear if 
we notice that in the depth flaw map (top of fig. 12-8) there are less regions with 
unacceptable depths than in the corresponding depth map obtained from images 
which were not subsampled (fig. 12-1). The initial depth map (middle) is not very 
informative here. As before, the pixels with unacceptable depths are left blank (white) 
in the initial depth map. A patch of maximum 4 pixels in radius (77 = 4 pixels) is 
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used for completing the initial depth map. Even this completed depth map (bottom 
of fig. 12-8) offers only a very vague intuition about the boundaries of the objects in 
the image. 

In the next step, we have used a patch of 1 pixel in radius (r p = 1) for the depth 
estimation at any single pixel. The results are shown in fig. 12-9. As expected, the 
depth flaws have not fully disappeared (top). These points are left blank (white) in 
the initial depth map (middle). For obtaining the complete depth map (bottom), a 
patch of maximum 6 pixels in radius (r/ = 6) is used in this case. Considering the 
subsampling size of 2 x 2 pixels, these results are located somewhere between the 
results of nonsampled images with r p = 2, and r p = 3 (figures 12-5, and 12-6). 

Figure 12-10 shows the results for the subsampled images for the case with r p - 
2 pixels, and rj — 9 pixels. 

A careful observation shows that there are not many differences between sampled 
and nonsampled results from the point of view of identifying different objects in the 
environment. However, the depth maps of subsampled images have much better 
quality and are relatively free from the systematic noise. This is quite clear if we 
notice that the vertical black lines between the books which were seen in previous 
depth maps are absent here. These lines represent narrow but deep vertical gaps 
between the books which did not actually exist in the environment. 

Furthermore, due to the printer grey level limitation, quality depth maps cannot 
be printed out. The computed depth maps are much better than what are shown here. 
For example each book has its relatively uniform depth which clearly distinguishes it 
from its neighboring books when there is a depth change in the real environment. 



12.7 Summary 

This chapter combined the individual results that we had obtained in previous chap- 
ters and used them in the recovery of depth maps. The recovered depth maps are 
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quite good considering that the input to the system was only two unrestricted frames. 
These images were real and noisy. Furthermore, the motion was not known in ad- 
vance, and the recovered motion was used in the computations. It is also important 
to notice that simple computations have been involved in all the steps. 

The experimental results show that by subsampling the initial images, much better 
depth maps are obtained. This is due to the fact that subsampling acts as a low pass 
filter and eliminates the high frequency noise which is inherent in real images. 

An overall study of the experimental results in this chapter shows that depth maps 
obtained by using an r p = 2 or 3 pixels seem to be a good choice. This is probably 
because of the fact that a mask of 2 x 2 pixels is used for the computation of gradients. 
As a result, using smaller r p will not give a good depth map. On the other hand, 
using larger r p 's may result in the elimination of some fine details of the depth map 
and does not improve the overall quality of the depth map. 

It should also be pointed out that we do not have any control over choosing rj. 
The algorithm automatically chooses an r/ large enough to include pixels with reliable 
depths in order to find estimates for depths at pixels where depths were missing in 
the initial depth map. 

All the results in this chapter were constructed by using a single rj for obtaining 
depth estimate at any pixel point with an unacceptable depth value. An adaptive 
approach which chooses r p appropriately at any desired pixel point will result in 
smother depth maps. 



12.7: Summary 
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Figure 12-4: The depth flaw (top), initial depth (middle), and completed depth (bot- 
tom) maps for the fixated cup image sequence with r v = 1 pixel, and 77 = 9 pixels. 
The areas close to the viewer look brighter. 
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Figure 12-5: The depth flaw (top), initial depth (middle), and completed depth (bot- 
tom) maps for the fixated cup image sequence with r p = 2 pixels, and 77 = 11 pixels. 
The areas close to the viewer are shown brighter. 
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Figure 12-6: The depth flaw (top), initial depth (middle), and completed depth (bot- 
tom) maps for the fixated cup image sequence with r p = 3 pixels, and rj = 15 pixels. 
The areas close to the viewer look brighter. 
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Figure 12-7: The depth flaw (top), initial depth (middle), and completed depth (bot- 
tom) maps for the fixated cup image sequence with r p = 4 pixels, and ?'/ = 17 pixels. 
The areas close to the viewer look brighter. 
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Figure 12-8: The depth flaw (top), initial depth (middle), and completed depth (bot- 
tom) maps for the subsampled (by 2) fixated cup image sequence with r p = 0, and 
rj — 4 pixels. The areas close to the viewer look brighter. 



120 



Chapter 12: Depth Map Recovery 



'Z 



«. 


-~'^ .*' 








w 












■ '•„ 



■*■•:.*> : 



I - % 






:'-^^- y :::-c:yy\-> 



I ??*?■ £ 







■*fcV«i& ^.n -4 ^f.ii*.! 









Figure 12-9: The depth flaw (top), initial depth (middle), and completed depth (bot- 
tom) maps for the subsampled (by 2) fixated cup image sequence with r p = 1, and 
rj = 6 pixels. The areas close to the viewer look brighter. 
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Figure 12-10: The depth flaw (top), initial depth (middle), and completed depth 
(bottom) maps for the subsampled (by 2) fixated cup image sequence with r p = 2, 
and rj = 9 pixels. The areas close to the viewer look brighter. 
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Calibration Issues 



Chapter 13 



Camera calibration is an important area of research involving the study of tech- 
niques for obtaining reliable estimates for the required internal and external param- 
eters of a camera in a vision system. 

For many years, computer vision scientists have been working on different aspects 
of camera calibration problems such as focal length (principal distance) [77, 86, 87], 
principal point (image center) [33, 86], scale factor (difference between the scanning 
frequency of the camera sensor plane and the scanning frequency of the image cap- 
turing board frame buffer) [33, 47], intrinsic parameters (camera internal geomet- 
ric and optical characteristics) [77], extrinsic parameters (the 3D position and ori- 
entation of the camera coordinate relative to a certain world coordinate system) 
[77, 85, 87, 18, 16, 86], and the hand-eye transform system (the 3D position and ori- 
entation of a camera relative to the last joint of a robot manipulator in an eye-on-hand 
configuration) [78, 79, 12]. 

In the previous chapters we saw that some parameters such as focal length and 
principal point have important role in the formulations. Manufacturers usually give 
a nominal value for the focal length but this nominal value is not always sufficiently 
accurate to be used in the computations. Some other important parameters such as 
the true principal point are not given at all- 
in this chapter, some of the calibration techniques used in this work will be de- 
scribed. 
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13.1 Principal Point Calibration 

The principal point is where the optical axis intersects the image plane; see fig. 2-1. 
Ideally, the principal point is located at the center of the image plane. However, in 
off-the-shelf cameras the principal point is not necessarily located at the center of the 
image plane. Finding the true location of the principal point is important because 
those values appear in our algorithms. 

For the cup images the nominal image center was used as the principal point 
because the camera was not accessible to be calibrated. On the other hand, in the 
case of the landscape images the true principal point was obtained using a direct 
optical method [33]. 

The experimental results showed that the true principal point was considerably 
off from the nominal image center. It was located at about 13 pixels to the left and 
13 pixels below the nominal image center. 

13.1.1 Direct optical method 

The direct optical method is a very simple and accurate calibration technique for 
finding the principal point. This method requires only a laser. The lens assembly is 
used as a reflecting surface and therefore, the lens can remain mounted on the camera. 

When a laser beam is pointed at a lens assembly, part of the light is reflected 
when the beam enters the glass and also when it leaves it. Multiple reflections occur 
when the beam is reflected within the lens and can be observed on a piece of paper 
attached to the front of the laser with a small hole for the primary beam. With some 
experimental skill the laser can be adjusted relative to the lens so that all reflections 
coincide with the primary beam, indicating that it is aligned with the optical axis. 
Once aligned, an attenuation filter is placed in the optical path, the camera is turned 
on and the center of the light spot observed can be used as the image center. 

This method is commonly used in experimental optics to align lens assemblies and 
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gives reproducible results. If the lens is removed, the reflection from the surface of 
the image sensor will also give an indication of its perpendicularity with respect to 
the optical axis. When a low power laser (< lOmW) is used, no harm is done to a 
discrete array camera sensor (CCD). However, vidicon tubes might be damaged by 
burning in. 

13.2 Calibration of the Rotation Axis 

In the landscape experiments, we did not explicitly apply any vertical translation 
(along Y axis). However, fig. 8-2 show a considerable vertical translation of about 
-0.9 mm. This is mainly because the real rotation axis does not pass through the 
center of projection 1 . 

To clarify this, we should mention that in motion vision, it is assumed that the 
rotation axis passes through the origin of the viewer centered coordinate system, i.e 
the center of projection. But at the CMU Imaging Laboratory, the rotation mechanism 
was not set up to align the Z axis of rotation with the optical axis. The CMU vision 
system was equipped with several cameras and evidently the camera used for taking 
the landscape images was set off center. However, for obtaining the experimental 
results, we have employed algorithms which erroneously assume that the rotation 
axis passes through the center of projection. 

According to the basic kinematics, the compensating translation which results 
from shifting the rotation axis is given by 

V = -wxB (13.1) 

where B is a vector extending from a point on the real (desired) rotation axis to a point 



1 If the CCD edges are not accurately aligned with the horizontal and vertical axes of the camera 
frame, i.e. the CCD is mounted at an angle with respect to the camera coordinate system, such 
kind of errors happen in both vertical and horizontal directions. But it is not the case here because 
the inaccuracy of motion estimation has occurred only in the vertical direction. 
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on the assumed rotation axis; see fig. 13-1. In our experiment, V Q = —{wz) x (bx) 



Y" 



Assumed rotation axis 



Real rotation axis 




Figure 13-1: In motion vision the assumption is that the rotation axis passes through 
the center of projection (origin). In the landscape image sequence, the true rotation 
is parallel to the optical axis but does not pass trough the origin. This will result in 
a translation which should be compensated for. 

where V = — 0.9?/ mm, and to = —0.3 degree. As a result, the real rotation axis was 
located at about b = — (— 0.9)/((— 0.3 x 7r)/180) = —172 mm perpendicular distance 
from the optical axis in the horizontal plane. 



13.2.1 Generalization 

A similar method can be used for the calibration of the rotation axis which is parallel 
to the optical axis in a camera system arrangement in the general case. 

In order to find the real location of the rotation axis, the following steps should 
be taken: 

• Step 1: Apply a pure rotation about the axis which is supposed to be the optical 
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axis. 

• Step 2: If rotation wr is not accurately known, compute it by applying eqn. 4.4 
to a relatively large patch around the principal point. 

• Step 3: Estimate the apparent motion (u ,v ) at the principal point using the 
eqn. 8.4 or 4.4. 

• Step 4- The real location of the rotation axis is given by, 

Vol 

(13.2) 



Z ° W R 



by = +^~ 

where Z is depth at the principal point, and / is the focal length of the camera. 

Point (b x , by) represents the location where the real rotation axis (which is parallel 
to the optical axis) intersects the image plane. 

13.3 Summary 

Focal length, principal point, and the rotation axis position are the three most impor- 
tant factors which can effect the computations in our motion vision algorithms. 

The experimental results show that we may be able to get away with using the 
nominal focal length as the focal length, and using the image center as the principal 
point. However, we have to calibrate the system for finding the real rotation axis and 
compensate for the resultant translation if the rotation axis does not pass through 
the projection center. The calibration technique introduced in this chapter offers an 
easy and reliable solution to this important problem. 
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Conclusions 



Chapter 14 



This thesis introduced a general motion vision system which takes any sequence 
of images as its input and recovers the motion and shape without any need to check, 
choose, and adjust parameters. A complete implementation of this motion vision 
system has been tested on real images and the critical issues involved in the its 
autonomous implementation have been studied. This chapter makes some concluding 
remarks about this fixation based motion vision system. 

14.1 Features 

• In contrast to previous work done in the area of motion vision, our solutions are 
general and do not impose any severe restrictions on the motion or the structure of 
the environment. 

• The fixation method uses neither optical flow nor feature correspondence. In- 
stead, it directly employs the image brightness gradients. 

• Our motion vision system neither requires tracked images as input nor uses 
hardware tracking for obtaining fixated images. Instead, it introduces a pixel shifting 
process for constructing fixated image sequences at any arbitrary fixation point. This 
process is done entirely in software without moving the camera for tracking. 

• The fixation method does not restrict the fixation point and virtually any point 
can be chosen as the fixation point. 

• The algorithms and formulations presented in the fixation method are simple 
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and have been successfully implemented on real images. 



14.2 Results 

• Good estimations for motion parameters can be obtained using optimum patch sizes 
(see chapter 8). 

• The novel introduction and use of normalized error has enabled us to find opti- 
mum patch sizes which result in good estimates for motion parameters. This technique 
has been implemented on many real image sequences (see chapter 9). 

• The novel pixel shifting process for constructing fixated (tracked) images has 
been successfully tested on several real image sequences (see chapter 11). 

• The experimental results in chapter 12 show that good depth maps can be 
obtained using only two monocular real images. If we use the data from a single pixel 
for recovering the corresponding depth, the reliable depth map will be sparse. Using 
the information from several pixels in a surrounding patch for finding the depth at 
its central point results in a relatively dense map of reliable depths. We can obtain 
even better results by subsampling the initial images. Subsampling acts as a low pass 
filter and overcomes some of inherent high frequency noise in real images. 

• We may get away with using the nominal focal length and principal point in the 
fixation formulations, but we have to make sure to calibrate the imaging system for 
the real rotation axis. The method described in chapter 13 offers a simple solution to 
this important practical problem which can result in considerable motion estimation 
errors if it is not detected and compensated for. 

• The implementations were done on a Sun SPARCstation IPX using C codes. 
Despite not using either parallel or optimized programs, the actual run-time for find- 
ing the motion parameters and the depth map for an image of 227 x 280 pixels was 
about a fraction of second and a few seconds respectively. 
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14.3 Assumptions 

• In the process of solving the general motion vision problem and writing the eqn. 3.27, 
we assumed that motion parameters can be obtained using a small patch around 
the fixation point. This is a pure geometric assumption and does not place any 
restrictions on the depth topology. Numerous experimental results in chapter 9 show 
that optimum patch sizes are small enough to justify our assumption. 

• This work assumes that there is one rigid motion between the environment 
and the observer. However, small deviations from rigidity is tolerated by the system 
because it is treated as noise and the least squares methods finds the best solution 
which fits the whole data. 

14.4 Shortcomings 

• The fixation method fails if the fixation point is located at the center of a uniform 
brightness patch because in such a case, motion will be undetectable. However, we 
have presented a mechanism for preventing this from happening by introducing an 
autonomous technique which chooses an appropriate location for the fixation point 
(see chapter 10). 

14.5 Relation to Other Works 

• As oppose to other work done in area of direct methods, our fixation technique 
estimates both the motion and shape for the general case [69, 60]. 

• In recent years, many Kalman filter based techniques have tried to improve the 
depth estimations over time by using more than two frames [38, 39, 40, 56, 57, 58, 59, 
25]. These techniques not only need to know, the motion in advance but also require 
a good initial guess for the depth map in order to converge to a solution. Despite 
these major advantages of Kalman filter methods, the depth maps recovered by our 
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fixation method are far more superior compared to those obtained by the Kalman 
filtering methods even after several iterations [26, 27]. 

• Recently, Tomasi and Kanade [76, 75] introduced a feature based technique for 
recovering the motion and shape from a sequence of images. Their method is different 
from our work in the following sense: 

- It assumes orthographic projection which handicaps the system when dealing 
with close by objects. 

- It uses feature correspondence. 

- It requires choosing and tracking many feature points. 

- Depth is obtained only at the feature points. 

- It is computationally very expensive. 



14.6 Future Extensions 

• The motion estimates obtained from fixation method are quite satisfactory. However, 
the depth maps may be improved by using more than two image frames in a Kalman 
filter based system as follows: 

- Converting the input images to a sequence of fixated images at a desired fixation 
point using the pixel shifting process. 

- Obtaining the motion estimates from the fixation method if it is not known. 

- Using the depth map estimates from the fixation method as the initial guess for 
the Kalman filter system. 

Employing such a hybrid system can potentially improve the depth map and 
accelerate the convergence rate of the Kalman filter. 

• The algorithms and formulations in the fixation method are very well suited 
to parallel implementation. Such an approach overwhelmingly improves the system 
performance because most of the operations are simple additions and subtractions 
which are done independently but all over the image. 
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• Due to their parallel nature, the fixation algorithms can be implemented on a 
single chip using analog VLSI techniques such a**the one Ivy Mead [41). This seems 

'tobeW*ttr**iW 

• By using segmentation, this work can be extended to multiple motion case. 
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Derivation of Brightness Change 
Constraint Equation 

Appendix A 



The brightness change constraint equation (BCCE) relates the change in the image 
brightness at a point (x,y) to the apparent velocity (u,v) of the brightness pattern 
at that point in the image. This appendix describes in detail the steps involved in 
the derivation of the BCCE [30, 54, 29]. 

Let E(x,y,t) denote the image brightness at time t at the image point (x,y). 
Then, if u{x,y) and v(x,y) are the x and y components of the apparent velocity at 
the point, we expect that the brightness will be the same at time t + St at the point 
(x + Sx, y -f Sy), where Sx = uSt and Sy = vSt. In other words, 

E(x, y, t) = E(x + uSt, y + vSt, t + St) (A.l) 

for small time interval St. The underlying assumption in writing the eqn. A.l is slow 
spatio-temporal variations in lighting which is true for many practical applications. 

If brightness varies smoothly with x, y, and t, we can expand the right hand side 
of the above equation in a Taylor series to obtain 

E(x, y, t) = E{x, y, t) + Sx— + Sy— + St— + e (A.2) 

where e includes second- and higher-order terms in Sx, Sy, and St. Canceling E(x, y, t), 
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Appendix A: Derivation of Brightness Change Constraint Equation 



dividing through by St, and taking the limit as St — > 0, we obtain 



dxdE dy_dE dE_ Q 
dt dx dt dy dt 



(A.3) 



which is actually just the expansion of the total derivative of E with respect to time 
into its partial derivatives, in other words 



dE_ 
~dt 



0. 



(A.4) 



Using the abbreviations 



and 



x t 

< 

yt 

V 


— 


dx 
dt 

dy 
dt 


E x 


= 


dE 
dx 


Ey 


= 


dE 
dy 


E t 


= 


OE 
dt 



equation A.3 can be written as 



(A.5) 



(A.6) 



E t + x t E x + y t E y = 0. 



(A.7) 



The above equation is called the brightness change constraint equation because it 
expresses a constraint on the components x t and y t of the apparent velocity at a point 
{x,y) in the image. 

In appendix B, we will show how the derivatives E x , E y , and E t are estimated at 
any image point. 



Computation of Brightness 
Gradients 

Appendix B 



The spatial and temporal derivatives of the image brightnesses are the basic data 
blocks in the direct methods. This appendix describes the formulations behind the 
estimation of the brightness gradients in images [30, 29]. 

The spatial brightness gradients E x , E y , and temporal brightness gradient E t are 
computed simply by using the first differences of image brightness values on a cubic 
grid; see fig. B-l. 

Using the indices i, j, and k to represent x, y, and time t respectively, the estimates 
of spatial gradients E x and E y are give by: 

E x ~ 77— ((Ei + \j t k + Ei + ij t k+i + Ei + u + \ t k + Ei + ij + i t k+\) 

4:0 X 

—(Ei,j,k + Ei,j,k+i + Ei,j+i,k + Eij + i t k+i)), (B-l) 



and 



E y ~ 77— ((^t,j+i,fc + E{ j + i t k+i + Ei + \ t j+\ t k + Ei + i t j + i y k+i) 
Uy 



—(Ei t j t k + Eij t k+i + Ei + ij t k + Ei + ij t k+i)), (B.2) 
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Appendix B: Computation of Brightness Gradients 



2nd Image 



1st Image 




Figure B-l: The first brightness derivatives required in the direct methods can be 
estimated using first differences in a 2 x 2 x 2 cube of brightness values. The estimates 
apply to the point where four neighboring pixels in an image meet, and at a time 
halfway between two successive images. 

and the temporal gradient E t is 



E t ~ — ((E it j tk+1 + Eij+i t k+i + Ei + ij t k +1 + Ei +1 j + i tk +\) 

—(Eij t k + Eij + i t k + Ei + -[j t k + Ei + i t j+i t k))- 



(B.3) 



These formulations give the brightness gradients at a point lying between four neigh- 
boring pixels, and between successive images. 

Considering the fact that we perform spatial tessellation by using pixels and tem- 
poral tessellation by employing individual time varying frames, the above algorithms 
compensate for part of the tessellation errors involved in discrete digitized images. 



Depth at Fixation Point 

Appendix C 



The results in chapter 3 show that after obtaining the translation t, we need to find 
Z (depth at the fixation point) in order to estimate a depth Z at any point (x,y) in 
the image plane. This appendix introduces an algorithm for finding the depth Z . 
At the fixation point, eqn. 3.26 is exactly expanded to 

E t + u Ko v ■ R + (-1 - i-)(s, • t) = (C.l) 

Zi Zj 

which is similar to eqn. 3.27. Theoretically, all terms of the eqn. C.l vanish because 
E t is zero at the fixation point, and vr = applies to all points including the fixation 
point which means v • R = Tftnj 1 = 0. As a result, we cannot directly obtain the 
depth Z from eqn. 3.26. However, at any point i around the fixation point, depth 
Zoi can be obtained from eqn. 3.26 as 

By averaging N of such neighboring depths, we can estimate the depth Z as 

_ _L_x y? / v.- x r - ||r || 2 s,- \ 
~N\\T f'^\Eti\\To\\+uj Ro (v i -r )) { • ^ 

where s,-, Vj, and E ti are computed for N points around the fixation point. In eqn. C.2, 
it is assumed that Z oi « Z which is valid considering the averaging in eqn. C.3. 
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